A
and more than 80% under RMSD=3.0
A
(966c).
Two structures of AChE (4 cross docking experiments) and 10 struc-
tures of trypsin (100 cross docking experiments) with their respective in-
hibitors demonstrate the capabilities of ISE-dock to deal with protein side
chain exibility. In both cases, high quality docking solutions are obtained
in terms of RMSD of all movable atoms from their experimental positions.
Docking populations for AChE contain solutions with RMSD0.37
A, and
in the worst case, RMSD0.85
A. In
94 cases, the entire docking sets contain solutions with RMSD<2.0
A and all
docking sets contain solutions with RMSD<3.0
A.
This work shows that ISE-dock is superior in many aspects to the cur-
rently well established docking programs Glide, GOLD and AutoDock
in exible ligand rigid protein docking. It has been also shown that ISE-
dock deals successfully with various degrees of protein exibility. In order
to handle exible proteins in full extent, the scoring scheme needs to be
redesigned. The latter task is beyond the scope of this work.
Protein exibility is an important aspect of a protein-ligand docking pro-
gram. Other degrees of freedom that were not accounted for in this work,
but that can be introduced into ISE-dock relatively easily are modeling
of structurally important water molecules and protonation and tautomeric
states of the interacting molecules.
Contents
1 Introduction 1
1.1 Current drug discovery process . . . . . . . . . . . . . . . . . 1
1.2 Flexibility in molecular interactions . . . . . . . . . . . . . . . 5
1.3 Energy and thermodynamic potentials . . . . . . . . . . . . . 7
1.4 Common energy components . . . . . . . . . . . . . . . . . . . 12
1.5 Force elds and scoring functions . . . . . . . . . . . . . . . . 19
1.5.1 Force eld based energy functions . . . . . . . . . . . . 20
1.5.2 Approximate energy functions . . . . . . . . . . . . . . 21
1.5.3 Statistical potentials . . . . . . . . . . . . . . . . . . . 22
1.5.4 Geometric and chemical complementarity functions . . 23
1.6 Energy funnels . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.7 Multiple binding modes . . . . . . . . . . . . . . . . . . . . . 25
1.8 Docking techniques . . . . . . . . . . . . . . . . . . . . . . . . 26
1.8.1 Flexibility in docking programs . . . . . . . . . . . . . 26
1.8.2 Search algorithms . . . . . . . . . . . . . . . . . . . . . 29
1.8.3 Evaluating docking programs . . . . . . . . . . . . . . 32
1.9 Open problems and issues . . . . . . . . . . . . . . . . . . . . 34
2 Methods 35
2.1 Energy function . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.2 AutoDock docking program . . . . . . . . . . . . . . . . . . 37
2.2.1 Lamarckian Genetic Algorithm . . . . . . . . . . . . . 38
2.2.2 Problem representation . . . . . . . . . . . . . . . . . . 41
2.3 ISE-dock program . . . . . . . . . . . . . . . . . . . . . . . . 42
vi
Contents vii
2.3.1 Iterative Stochastic Elimination algorithm . . . . . . . 43
2.3.2 Problem representation . . . . . . . . . . . . . . . . . . 46
2.3.3 Protein exibility . . . . . . . . . . . . . . . . . . . . . 46
2.4 Rigid protein docking . . . . . . . . . . . . . . . . . . . . . . . 49
2.4.1 LGA docking . . . . . . . . . . . . . . . . . . . . . . . 49
2.4.2 The data set . . . . . . . . . . . . . . . . . . . . . . . . 50
2.4.3 Comparisons and their analysis . . . . . . . . . . . . . 51
2.4.4 Paired t-test . . . . . . . . . . . . . . . . . . . . . . . . 52
2.4.5 Comparing CPU time . . . . . . . . . . . . . . . . . . 53
2.4.6 Energy funnels . . . . . . . . . . . . . . . . . . . . . . 53
2.5 Flexible protein docking . . . . . . . . . . . . . . . . . . . . . 54
2.5.1 Protein backbone Flexibility . . . . . . . . . . . . . . . 56
2.5.2 Flexibility of a single side chain . . . . . . . . . . . . . 59
2.5.3 Flexibility of several side chains . . . . . . . . . . . . . 62
2.5.4 Comparisons and their analysis . . . . . . . . . . . . . 63
3 Flexible ligand rigid protein docking 64
3.1 Top scoring poses . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.2 Top 20 poses . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.3 Solution space coverage . . . . . . . . . . . . . . . . . . . . . . 69
3.4 Time performance . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.5 Multiple binding modes . . . . . . . . . . . . . . . . . . . . . 73
3.6 PDB data supports distinct funnels . . . . . . . . . . . . . . . 78
4 Flexible Ligand Flexible Protein Docking 84
4.1 Protein backbone exibility . . . . . . . . . . . . . . . . . . . 84
4.2 Flexibility of a single side chain . . . . . . . . . . . . . . . . . 87
4.3 Flexibility of several side chains . . . . . . . . . . . . . . . . . 90
4.4 Discussion on protein exibility . . . . . . . . . . . . . . . . . 94
5 Conclusions 97
Appendices (submitted separately) 100
Contents viii
A Results published in a peer reviewed journal 101
B ISE-dock and AutoDock parameters and their values 103
B.1 AutoDock parameters and their
default values . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
B.2 ISE-dock parameters and their
default values . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
C Detailed Results 107
C.1 Flexible Ligand Rigid Protein docking results results . . . . 107
C.2 Flexible ligand rigid protein docking energy landscapes . . . 111
D Flexible ligand exible protein docking. Trypsin data set 119
List of Figures 123
List of Tables 128
Acknowledgments 129
Bibliography 130
Hebrew abstract 140
Chapter 1
Introduction
1.1 Current drug discovery process
Since the dawn of history, humankind has been searching for ways to ght
diseases and improve the quality of life. Modern science has undergone
tremendous developments and has successfully developed a great variety of
medicines. Nevertheless, the constant search for better drugs that reduce
side eects, cure more diseases, and extend life expectancy and quality has
never stopped. Drugs have traditionally been discovered by experimental
methods, but more recently, computerized (virtual) drug discovery methods
have been devised and prove to be helpful in the process of drug discovery
and in designing drugs. Figure 1.1 presents an overview of current methods
for designing drugs and discovering them. Roughly, the systematic search
for new active molecules can be divided in three categories: classical chem-
istry drug discovery, high trhoughput screening and virtual high throughput
screening.
1
Introduction 2
Figure 1.1: Schematic diagram of the main methods in the drug discovery process. Arrows
designate process ow. Black asterisks mark steps that may involve molecular docking.
Abbreviations: SAR structure-activity relationship; QSAR quantitative SAR; ADME-
Tox absorption, distribution, elimination, toxicity
Classical chemistry drug discovery During the classical drug design pro-
cess, medicinal chemists use their personal experience, combined with ratio-
nalizing the knowledge of active compounds and the suspected drug target.
The process involves iterations of data evaluation, synthesis and purica-
tion, and assessment of biological activity. Only a few compounds can be
processed simultaneously using this approach. This approach is still labor-
intensive, slow, and expensive, requiring costly materials and techniques.
High throughput screening In several large and medium sized Pharma
companies, high throughput screening (HTS) techniques, by robotically scan-
ning the activities of hundreds of thousands of compounds has become a
Introduction 3
major method. The targets for screening can be single molecules, colonies
of bacteria, fungi, or animal cells. In this kind of experiments, the eect
is recorded using fast and, sometimes, non-specic parameters such as color
change, conductivity of electric current, particle count, etc. HTS experiments
are frequently conducted without exact knowledge about the target structure
or about the mechanism of action. While faster than the rst approach, HTS
often suers from ambiguity during the process of results interpretation and
still may require expensive materials and equipment.
Virtual high throughput screening (V-HTS) In order to save time and
reduce costs, virtual HTS is designed to mimic the HTS task in silico and
is expected to indicate which compounds are worth testing in wet exper-
iments. Instead of screening real compounds against real targets, virtual
computer libraries of existing and not yet existing chemicals are used. Nat-
urally, this process is much cheaper and, usually, faster than the two former
ones. On the other hand, the V-HTS process relies heavily on the con-
struction and validation of the underlying computational methods and on
the interpretation of the results. Availability of fast, validated and accurate
computational screening methods is, usually, the major bottle neck of the
V-HTS approach. The main tool of V-HTS is molecular docking, in which
a ligand or potential drugs is driven in order to nd a good parking place
on the biological target.
Docking programs are computational tools that model the structure and
the nature (anity) of molecular complexes. These programs aim to predict
geometry of inter- and intra-molecular interactions and to rank the various
Introduction 4
possibilities. The main advantage of computational techniques in general,
and of docking programs particularly, is that they are much cheaper and
faster than the corresponding wet techniques. Docking programs are used
as a primary screening tool during the virtual high throughput process. They
also assist biologists, biochemists, and medicinal chemists in designing novel
molecules and in interpreting experiments that assess the activity of already
existing ones.
Two main goals of docking tools are (1) to assist in designing novel chemi-
cal compounds, and (2) to study the nature of interactions between biological
targets and ligands. These may include endogenous molecules such as hor-
mones, or external ones such as drugs or toxic compounds.
Every docking program requires that the three dimensional (3D) struc-
ture of the target molecule be known to some extent. Protein Data Bank
(PDB)[5] is a publicly available repository that contains more than 42,000 3D
structures of biological macromolecules resolved with various degrees of res-
olution. Since 1999, the U.S. National Institute of General Medical Sciences
(NIGMS) has sponsored a large scale project called the Protein Structure
Initiative (PSI)[75]. The main goal of this initiative is to enlarge the number
of solved 3D structures of proteins, which would enable better coverage of
the existing drug targets and the discovery of new ones. Since the estab-
lishment of the PSI, the project has yielded more than 1,800 solved protein
structures (as of June 2006), with current estimated rate of more than 500
solved structures per year[67].
Despite the progress that the eld of docking has undergone in the last
few years, several problems still exist. One of the major problems is energy
Introduction 5
calculation. Another major problem is accounting for the many degrees of
freedom of the docking problem. These include exibility of the molecules,
protonation and tautomeric states etc. Considering all these degrees of free-
dom results in a tremendous combinatorial space that each docking tool has
to search. Due the exible nature of molecules, it is important not to limit
the scope of the docking solution to a single structure, but instead, to predict
collection (ensemble) of low energy multiple conformations that contribute
to the biological activity.
In this work, I present ISE-dock a protein ligand docking tool that
successfully overcomes the huge combinatorial space problem, while account-
ing for ligand and, to a lesser extent, protein exibility, and that is capable of
producing arbitrary large docking populations without substantial extension
of CPU time.
1.2 Flexibility in molecular interactions
Since 1894, when Emil Fischer proposed the famous lock and key model[21],
the perception of the nature of binding between biological molecules has un-
dergone several changes. Although evidence that support the lock and key
model exists (see for example:[6, 55, 22]), two models that are considered
to represent the majority of receptor-ligand interactions are induced t[48]
and equilibrium of multiple pre-existing conformations[63, 52, 76]. Induced
t theory assumes that the conformation of the target and ligand aect each
other as they approach an encounter. The conformation of the nal complex
may not be derived directly from the conformation of the separate molecules.
Introduction 6
Pre existing conformations assumes that the nal target and ligand con-
formations are already probed by the isolated molecules but, they could be
of much higher energy than the most abundant conformation and therefore,
their accessibility is minute in the absence of the partner. It is not uncom-
mon that the most populated unbound states of a protein are not those that
are most populated in the bound structure[97, 10]. The same notion is true
for ligands: it was found[83] that ligands rarely bind their receptors in the
calculated global minimum conformation. Moreover, in 60% of the cases, the
bound ligand is not found even in its local energy minimum with at least
10% of the examined ligands bind with strain energies over 9 kcal/mol.
Many theoretical and experimental studies support either the induced t
or pre-existing populations models in dierent cases of binding[37, 55, 63].
From a thermodynamic point of view, the two models are equivalent, how-
ever, describing biological systems in terms of pre-existing populations and
conformational selection is more useful in the process of drug discovery[97].
Regardless of which of the two models more accurately describes the na-
ture of binding, it is clear that molecular exibility is involved in complex
formation.
The process of binding may result in either increase or decrease of exibil-
ity. Decreased exibility may be attributed to enthalpy-entropy compensa-
tion, when more eective binding interactions are gained by freezing motion.
On the other hand, complex formation may be stabilized by entropic contri-
bution, associated with increased exibility[116]. It has been suggested[101]
that in 13 dierent MHC receptor-peptide complexes, the exibility is asso-
ciated with as much as 50% of free energy of binding.
Introduction 7
Flexibility plays an important role not only in complex formation, but also
in the mechanism of action of various complexes. For example, the conforma-
tional changes of several enzymes are very important for their activity[10, 40,
51]. Solved structures of protein-ligand complexes frequently show complexes
with 70 100% of the ligands surface area buried. Clearly, this kind of con-
formations could not be achieved without at least a minimal degree of protein
exibility. Works that analyze bound and apo-proteins show that although
there are complexes, where the protein undergoes almost no change upon
ligand binding[41], proteins that bind small molecules are usually subjected
to conformational changes[97, 61, 60, 88].
1.3 Energy and thermodynamic potentials
The three most common thermodynamic potentials are: internal energy, en-
thalpy and Gibbs free energy.
Internal Energy The internal energy (denoted as U or E) of a thermody-
namic system is the total kinetic energy due to the motion of particles and
the potential energy associated with the vibrational and electronic energy
of atoms, including the energy of chemical bonds. Internal energy does not
include the kinetic energy due to the motion of the system as a whole. It
does not account for potential energy due to the position of the system in an
external gravitational, electric or magnetic eld.
Introduction 8
The internal energy is essentially dened by the rst law of thermody-
namics, which states that energy is conserved:
U = Q+W +W
(1.1)
Where U is the change in internal energy of a system during a process, Q
is heat added to a system, W is the mechanical work done on a system,
and W
i=1
i
N
i
(1.7)
Where: U is the internal energy; P is pressure; V is volume; T is the tem-
perature; S is the entropy;
i
is the chemical potential of the i-th chemical
component; N
i
is the number of particles (or number of moles) composing;
the i-th chemical component. It can be shown that
G = H TS (1.8)
Where S is the change in the internal entropy of the system. The value
of G from equation (1.8) is used to determine whether a chemical reaction
is favorable or not: reactions with G < 0 will occur spontaneously, while
those with G 0 will not.
Binding Anity Non-covalent receptor-ligand interactions may be written
in the following general form[72]:
RL
k
d
k
a
L +R
Where: R, L and RL are the receptor, ligand and receptor-ligand complex,
respectively; k
d
and k
a
are kinetic constants of dissociation and associa-
Introduction 11
tion, respectively. This reaction describes dissociation of a receptor-ligand
complex. The thermodynamic equilibrium constant of this reaction in ideal
conditions is dened as:
K
d
=
[R][L]
[RL]
(1.9)
Where [X] denotes the molar concentration of the component X. The equi-
librium constant can be related to the change in the Gibbs free energy (eq.
(1.8)) of the above dissociation reaction:
G = G
0
= RTlnK
d
(1.10)
Here, R is the universal gas constant, and T is the absolute temperature.
G
0
is the free energy change at equilibrium under standard conditions (all
the chemical components are at 1M concentration, T=273.15K, pressure =
1atm).
In attempts to calculate the change in free energy upon binding (free
energy of binding), it is customary to separate the overall energy into distinct
components. These components usually may include entropy loss due to
association, entropy gain of water due to binding of the ligand (hydrophobic
eect), entropy loss in the receptor and the ligand due to constraints of
internal degrees of freedom, interaction between the ligand and the receptor,
and changes in the conformational (internal) energy of the molecules upon
binding.
Introduction 12
The basic assumption of most of the works on experimental or computa-
tional determination of binding energy is that dierent contributions to the
binding energy are independent and additive. Thus binding energy may be
written as a sum of its components[72]:
G
bind
= G
solvent
+
+ G
receptor
conf
+ G
ligand
conf
+
+ G
int
+
+ G
motion
(1.11)
. One should note that, based on the principles of independence and addi-
tivity of energy components, many other variants of this equation may be
written. Furthermore, the same assumption of additivity and independence
allows the creation of statistical functions that approximate the binding free
energy without direct connection to the underlying physical and thermody-
namic processes.
1.4 Common energy components
Based on the equation (1.11), energy calculations are divided into distinct
components. In this section I will describe the most commonly used terms
of energy functions. This list is by no means complete, but rather serves as
a brief introduction.
Introduction 13
Physically based potentials are mainly divided between bonding and non-
bonding expressions. Supplementary expressions for solvation or entropy loss
due to restricted rotations are sometimes added.
Non-bonding expressions
It is common to model pairwise interactions between atoms that are divided
by at least 4 covalent bonds in terms of electrostatic (Coulomb) and Van der
Waals interactions.
Coulomb potential We use Coulomb potential to estimate the enthalpy
contribution of any two charged particles to the overall potential energy:
E
el
=
Q
1
Q
2
r
(1.12)
Where Q
1
and Q
2
are the partial charges of the two particles, r is the distance
separating between them, and is the dielectric constant of the separating
medium. In vacuum, equals 1. Figure 1.2 shows a typical shape of electro-
static potential of charged particles.
Hydrogen bonds The hydrogen bonds (H-bonds) eect is highly related to
electrostatic interactions. This eect is caused by interaction of electroneg-
ative atoms with hydrogen connected to other electronegative atoms. The
nature of H-bonds allows charge transfer along the bond. The strongest H-
bond eect is achieved when the three interacting atoms (hydrogen donor,
hydrogen atom and hydrogen acceptor) and the mediating lone electron pair
lie on a single line. To account for this directionality, many force elds con-
Introduction 14
Figure 1.2: Typical shapes of electrostatic interactions energy. The energy of two identical
(full line) and opposite (dashed line) charges in vacuum are shown
tain explicit terms for the angle of the H-bond. For example, following is
the H-bond component of MM3 force eld[58] that demonstrates an explicit
term for the angle between the interacting atoms:
E
HB
=
#
_
1.84 10
5
e
120/P
2.25
P
6
D
_
l
l
0
_
cos
_
(1.13)
Where l and l
0
denote the actual and the reference H-bond lengths, respec-
tively,
#
is the depth of the energy potential well, P is the ratio of the sum
of the van der Waals radii of the atoms divided by the sum of the eective
interatomic distances between them and D is the dielectric constant. The
dependence of energetics on the angular relations of H-bonds plays an im-
portant role in the specicity of molecular interactions. Figure 1.3 shows
examples of typical inter- and intra-molecular hydrogen bonds.
Introduction 15
Figure 1.3: Examples of inter- (left) and intra- (right) molecular H-bonds
The majority of existing scoring functions does not include explicit terms
for hydrogen bonds[54], but rather rely on Van der Waals or electrostatic
interactions.
Van der Waals interactions Van der Waals (VdW) forces account for both
attraction and repulsion of non bonded atoms. Usually, Van der Waals en-
thalpy contribution of atoms is estimated using the Lennard-Jones (LJ) po-
tential:
E
V dW
=
N1
i
N
j=i+1
_
4
ij
_
_
ij
r
_
6
ij
r
_
12
__
(1.14)
Where
ij
is the depth of the potential well between the atoms i and j, r is
the distance between two atoms,
ij
is the distance at which the inter-particle
force is zero and N is the number of atoms.
Equation (1.14) is sometimes referred to as the 6-12 LJ potential, as op-
posed to 4-10 potential, a more smoothed estimation with lower repulsion
eect. Figure 1.4 presents the shape of the Van der Waals potential of two
identical atoms. Although the equation (1.14) is the most encountered
one, there are other ways to estimate Van der Waals energy (for example
Hills equation[38]).
Introduction 16
Figure 1.4: Van der Waals interaction energy of argon dimer. Taken from the Wikipedia
[113] under the GNU Free Documentation License
Bonding expressions
The three most common terms that describe the contribution of bonding
interactions to the overall energy are bond stretching, angle bending and
bond rotation (torsion).
Bond stretching One of the equations that describe the potential energy
for a covalent bond is:
E
stretch
= D
e
_
1 e
(rr
0
)
_
2
(1.15)
In this equation (which is often referred to as a Morse equation), D
e
is the
depth of the energy minimum, r
0
is the reference bond length, =
_
/2D
e
,
where is the reduced mass and is the bond vibration frequency.
To simplify the energy calculations, a harmonic potential is often applied
to bond stretching (Hookes law). Although less accurate, harmonic potential
Introduction 17
Figure 1.5: Comparison of Morse (dashed line) and Hookes harmonic (full line) poten-
tials of bond stretching energy around the minimum. To construct this graph, all the
parameters in equations (1.15) and (1.16) were assigned the value of 1
is faster to calculate and is accurate enough in the bottom of the potential
well.
E
stretch
=
1
2
k(r r
0
)
2
(1.16)
Figure 1.5 presents the shapes of Morse and Hookes potentials around the
minimum.
Angle bending The angle bending contribution to the potential energy may
be estimated using the following equation:
E
bending
=
1
2
(
0
)
2
_
1 k
1
(
0
) k
2
(
0
)
2
k
3
(
0
)
3
. . .
(1.17)
Where is the angle,
0
is the reference angle and k
1
, k
2
, . . . are force con-
stants specic to the bonds that form the angle. A good approximation of
Introduction 18
this general form equation is Hookes harmonic potential:
E
bending
=
k
2
(
0
)
2
(1.18)
Bond torsion One of the possible equations that describe the contribution
of torsions around chemical bonds is
E
torsion
=
N
n=0
C
n
cos
n
() (1.19)
Where C is some force constant, is the torsion angle, and N the num-
ber of rotating bonds. Although many force eld terms of bond torsion
contain the above equation, there is sometimes a need in more accurate esti-
mations. On the other hand, many force elds do not contain explicit terms
for torsions[54]. In these cases non-bonding terms for Van der Waals and
electrostatic interactions are used to achieve the desired potential prole.
Entropy estimation and solvation terms
A solute molecule that leaves the solution in favor of a complex with another
molecule produces two main eects on the systems entropy. First, it changes
the micro-structure of the water bulk that surrounds the two solute molecules.
This change results in more water molecules that are capable of creating
hydrogen bonds between themselves. The second eect is the change in the
internal degrees of freedom.
Entropy change estimation is one of the most challenging problems in
computational research of biological systems. The reason for the complexity
Introduction 19
of this task may be demonstrated by the Gibbs entropy formula:
S = k
B
N
i=1
p
i
log p
i
(1.20)
Where N is the number of possible discrete states of a system, and p
i
is the
probability of a certain state. Equation (1.20) results in a huge complexity.
The large number of possible states of a system leads towards very small
values of p
i
, which in turn requires extensive sampling and may lead to
large accumulation of errors. Several additional ways to exactly evaluate the
entropy exist, but they do not change the complex nature of the calculations.
For a review on entropy calculations in biological systems see ref.[3].
1.5 Force elds and scoring functions
During the process of docking, many conformations are searched. The pro-
gram needs to choose between the dierent conformations, thus each confor-
mation is given a numerical value, which in most of the cases, is supposed
to represent its relative stability. Computational functions that estimate
the energy of the system can be based on the principles of classical physics
(force eld based functions). Another class of functions combines statistical
physics equations with many approximations that are based on known macro-
structures. This class of methods is often called approximate or knowledge
based functions[82]. In addition purely statistical scoring functions exist.
Such functions are based on statistical analysis of various patterns, such as
distribution of contacts between dierent types of atoms[69]. Another ap-
Introduction 20
proach of the estimation of the tness of docking structures is to use shape
complementarity.
1.5.1 Force eld based energy functions
Force eld based scoring functions are based on the equations that were
mentioned in Section 1.4. Two major such energy functions are AMBER
[14] and CHARMM [65]. These functions dier in atom typing, parameters
for the various terms and in the basic equations that build them up. The
main equation of the AMBER force eld reveals the complexity that is
common to all the energy functions in this class:
E
total
=
bonds
K
r
(r r
0
)
2
+
+
angles
K
(
0
)
2
+
+
dihedrals
V
n
2
[1 +cos(n phase)] +
+
i<j
_
A
ij
r
12
ij
B
ij
r
6
ij
q
j
q
j
r
ij
_
+
+
i<j
_
C
ij
r
12
ij
D
ij
r
10
ij
_
(1.21)
In this equation, the last term is the estimation of hydrogen bonds energy.
The rest of the terms have already been discussed. A review of CHARMM,
AMBER and other common force elds has been recently published[64].
Due to the complexity of force eld based scoring functions, they pose
relatively heavy computational load on the computer, which results in rela-
tively low calculation speed. Thus, in the case of the docking problem, the
Introduction 21
full forms of these functions are mostly suitable for structure preparation
before docking or during the post-docking processing.
1.5.2 Approximate energy functions
As stated before, one of the major drawbacks of force eld based scoring func-
tions is their extensive computational cost due to the large number of energy
terms and their complexity. Moreover, several terms, such as solvation eect,
the contribution of the exibility to the overall system energy and others re-
quire sampling of multiple conformations in the solution space. To overcome
this obstacle, several knowledge based potentials have been proposed. In this
class of functions, the number of energy terms and the number of supported
atom or bond types are reduced. The general form of the remaining terms
resembles that of the force eld based functions. The parametrization is done
using statistical analysis of known structures of macromolecules. The struc-
tures are chosen according to the problem and may include folded proteins,
proteins bound to other proteins, small molecules, DNA, etc. It is possible to
perform calibration of the parameters using focused sets of structures (target
tailored functions). Studies exist that show that such a strategy improves the
accuracy of scoring functions[11, 92]. Because the parametrization of knowl-
edge based scoring functions is done using known macro-structures, they
implicitly account for entropic eects such as solvation and changes in inter-
nal degrees of freedom. Estimation of entropic and solvation contributions to
the overall binding anity is usually done using one or more of the following
terms[109, 49, 70]: hydrophobic match, solvent accessible surface (divided to
Introduction 22
atom types according to the extent of hydrophobicity/hydrophilicity), and
the number of internal degrees of freedom (usually, the count of rotatable
bonds). This support of entropic terms is gained without the costly compu-
tations.
On the other hand, the calibration process does not account for non-native
structures. This might lead to meaningless results when one attempts to
quantitatively evaluate poses that reside far away from an energy minimum.
Most existing docking programs (for example AutoDock [33, 71, 70],
FlexX [49], FlexE [12], Glide [23], GOLD [42] and others) use approxi-
mate scoring functions. It is possible to compensate for the relative lack of
accuracy of this class of functions by further re-scoring docking candidates
with or without an additional simulation step (such as minimization, molecu-
lar dynamics). This multistage approach was successfully adopted by several
research groups[8, 108]. For example, in one work[108], molecular dynamics
combined with MM-PBSA (molecular mechanics Poisson-Boltzman/surface
area) were used to re-rank the solutions suggested by DOCK 4.0. In that
work, a conformation within 1.1
j
_
e
(E
j
/kT)
(1.22)
In this equation (also known as the Boltzmann or Maxwell-Boltzmann dis-
tribution), N
i
is the number of molecules at equilibrium temperature T, in a
state i that has energy E
i
; N is the total number of molecules in the system
and k is the Boltzmann constant which, for gaseous and liquid systems is
identical to universal gas constant (R) from eq. (1.8). If the energy barrier
between two minima is low enough, and the temperature is high enough,
then the molecules in a system can alternate between multiple states. If the
dierences between binding energies (i.e. (G
bind
)) of two or more con-
formations is such that transformation of the system between them doesnt
eectively compensate for the separating energy barriers, these multiple con-
Introduction 26
formations may exist in the system simultaneously, presenting a phenomenon
known as multiple or alternative binding modes.
A growing body of data supports the existence of multiple binding modes
of ligands to receptors. These may manifest in the form of a ligand that binds
the same (or similar) protein in dierent distinct modes, or alternatively,
ligand molecules that share structural similarity may be observed in dierent
binding modes when bound to the same protein[18, 9, 44, 91, 57]. It is clear
that individual conformations of multiple binding modes, if they exist, may
have a unique contribution to the binding energies or specicity. The program
presented in this work, ISE-dock is capable to produce arbitrary large near-
optimal populations of docking solutions, resulting in an ecient sampling
of the energy hyperspace and increasing the chances of detecting alternative
binding modes.
1.8 Docking techniques
1.8.1 Flexibility in docking programs
The structural and energy considerations that were presented above imply
that accounting for exibility in docking programs is a necessary task. The-
oretically, accounting for molecule exibility in a system that contains N
atoms will result in 3N degrees of freedom (3 degrees of freedom for trans-
lating each atom). This number of degrees of freedom results in a colossal
rise in the computational complexity of docking calculations and cannot be
treated directly. In order to reduce the size of the solution space, several
Introduction 27
approaches are taken by, either alone or (more frequently) in various combi-
nations. These approaches include explicit exibility of only small parts of
the system; soft potentials and low resolution docking, and using multiple
conformations.
Selective exibility Among all the internal degrees of freedom that the-
oretically exist in the system, only dihedral torsions are usually taken into
consideration. This is due to the substantially lower energy barriers that
are needed for this type of movement, compared to bond stretching and an-
gle bending[54]. In addition, internal exibility is usually limited to certain
portions of the interacting molecules. Treating ligand exibility alone, and
keeping the protein rigid, reduces dramatically the combinatorial complexity
of a protein-ligand docking program. This approach is very popular. In fact,
most of the modern protein-ligand docking programs are capable of deal-
ing with full ligand exibility but not with the conformational changes of a
protein[95]. The rigidity of protein is a reasonable approximation in many
cases, and it has lead to several successes. Nevertheless, accounting for re-
ceptor exibility is a very important step toward improving the process of
docking[4, 46, 49, 12]. Najmanovich et al. have shown that in many cases
only a few side chains in the active side of a receptor change their confor-
mations during ligand binding[73]. In other cases, hinge-like movements of
large portions of the protein occur[89, 90], while retaining relative rigidity
of the remaining parts of the system. These ndings allow the user to par-
tially unfreeze the protein, while keeping a feasible combinatorial size of
the problem. Version 4.0.1 of the program AutoDock takes this approach,
Introduction 28
by allowing the user to specify the exible parts of the receptor (side chains
only). The ISE-dock program that is presented in this work (and was devel-
oped before the publication of AutoDock 4.0.1) takes a similar approach.
Hinge-based docking studies have also been reported[89, 90, 78].
Soft potentials Allowing partial inter-penetration of molecules by lowering
the repulsion penalties of VdW interactions is a way to implicitly account
for molecule exibility in docking simulations. For example, in a work by
Ferrari et al.[19], a modied, softer, Lennard-Jones potential was used in
order to screen large libraries of molecules against T4 lysozyme, a protein
that undergoes small conformational changes when binding dierent ligands.
Yet another way to allow intermolecular penetration to handle implicitly
protein exibility is to use proteins C
N
i,j
[(x
ij
)
2
+ (y
i,j
)
2
+ (z
ij
)
2
]
N
(1.23)
Lack of specicity, inability to dierentiate between more and less important
regions in a complex, and the need for a reference structure are several pitfalls
of this measure[13]. Nevertheless, RMSD is the measure of choice the vast
majority of docking techniques. It is widely accepted to treat solutions with
RMSD values below 2.0
i,j
_
A
ij
r
12
ij
B
ij
r
6
ij
_
+
+ G
hbond
i,j
E(t)
_
C
ij
r
12
ij
D
ij
r
10
ij
+E
hbond
_
+
+ G
elect
j,j
q
i
q
j
(r
ij
)r
ij
+
+ G
sol
i
C
,j
S
i
V
j
e
(r
2
ij
/2
2
)
+
+ G
tor
N
tor
(2.1)
The ve G terms in this equation are empirically determined using lin-
ear regression analysis, correlating a set of 30 protein-ligand complexes with
known binding constants and solved 3D structures. The rst and the third
terms of the above equation are standard expressions for VdW and elec-
trostatic interactions, respectively. In the second (H-bond) term, E(t) is a
directional weight based on H-bonds angle, t and E
hbond
is the estimated
average energy of hydrogen bonding between water molecules and a polar
atom. The unfavorable entropy eect of ligand binding (the fths term) is
a function of the number of sp
3
bonds N
tor
. The solvation term of eq.
(2.1) considers fragmental volumes of only carbon atoms in the ligand (i)
and all atom types in the receptor (j). Parametrization of the carbon atoms
distinguishes between aliphatic and aromatic atom types. The constant co-
ecients in equation (2.1) (A
ij
, B
ij
, C
ij
and D
ij
) are specic for each pair of
atom types.
Methods 37
During the docking process, the program evaluates any position of the
ligand by interpolating over those grids for the protein-ligand interaction
of each atom of the ligand according to its current position and adding the
internal conformational energy of the ligand. By default, the docking box has
the dimensions 22.5
A 22.5
A 22.5
A between
grid points. The version of AutoDock used in this work (3.0.5) supports
eight atom types: C (aliphatic carbon), A (aromatic carbon), N (nitrogen),
O (oxygen), S (sulfur), H (hydrogen), X and M (spare types for additional
atoms such as metal, halogen, phosphorus etc). It is customary[106, 98, 17]
to substitute the original AutoDock parameters for Zinc. We used the
following parameters, which lead to more accurate energy calculations[39]:
(radius: 0.87
i
n
i
(2.2)
PD =
N
variables
max
i
(n
i
) (2.3)
Where N
variables
is the number of variables and n
i
is the number of possible
values for i
th
variable.
During the rst phase (referred as elimination phase), a large number of
conformations is generated. The conformations are generated by randomly
picking a single value from the pool, and assigning it to the respective vari-
able.
Methods 44
Algorithm 2.4 Iterative Stochastic Elimination Algorithm
Require: problem represented as a set of variables and possible discrete
values
generate pool
2: initialize population
while size(pool) < threshold do
4: generate sample S of s random congurations
for all i S do
6: perform local optimization with probability P
score
i
= evaluate(i)
8: if (size(population) < outptutSize) or (score
i
< score
max
) then
add i to population
10: trunk population to outputSize
end if
12: end for
sort S
14: L = low energy part of S
H = high energy part of S
16: for all variable var pool do
for all value val pool
var
do
18: observedLow = number of occurrences of (var, value) L
ratio = expectedLow(value)/observedLow(value)
20: if ratio > threshold then
rank = ratio/threshold
22: mark pool
var,value
for elimination with rank
end if
24: observedHigh = number of occurrences of (var, value) H
ratio = observedHigh/expectedHigh
26: if ratio > threshold then
rank = threshold/ratio
28: mark pair (var, value) for elimination with rank
end if
30: end for
eliminate up to e% values with highest rank from pool
var
32: end for
end while
34:
perform exhaustive search of pool, add best scored congurations to
population
36: return population
Methods 45
The randomly generated conformations have a certain probability (0.06
by default) to undergo local optimization. The main purpose of the local
optimization step is to solve clashes and unfavorable conformations that are
caused by the discrete nature of the variable values (translation, rotation
and torsions). Unlike local optimization by the Lamarckian Genetic Algo-
rithm in AutoDock, local optimization does not aect the variables in the
possibilities matrix, only the energy values that are associated with them.
The sample is evaluated and sorted. The sorted sample is divided into three
uneven parts: subsets of lowest, highest and intermediate energy conforma-
tions. The intermediate subset is not used in the analysis. A particular
value of a variable may be discarded from the pool of values if one of the
two following criteria is met. The rst criterion is the occurrence of a value
in the higher energy subset with signicantly higher frequency that is ex-
pected under the random distribution assumption. Alternatively, a value
may be eliminated if it appears in the lower energy subset with lower than
random frequency. Not more than a user-specied portion of values may
be discarded at each iteration (the default value is 10%). The elimination
process is performed iteratively until the number of possible conformations
enables exhaustive search in a feasible time. During the exhaustive phase,
the solution candidates have a probability of P = 0.6 (default value) to un-
dergo local optimization. Note that local optimization probability in this
exhaustive phase is ten time larger than the probability for local optimiza-
tion during the elimination phase. During the whole process, a list of the
best seen conformations is updated kept.
Methods 46
The local optimization steps, the limit of discarded values per iteration
and the fact that the best seen conformations are collected during the elimi-
nation phase are new to this implementation of ISE and were not present in
previously published ones[30, 31].
The sample size and the sizes of lower- and higher-energy subsets depend
on the current pool depth (eq. 2.2) and are user congurable, as is the
required ratio between the expected and the observed occurrences of a value
(Algorithm 2.4, lines 20 and 26). The maximal fraction of eliminated values
for each variable and the probability of local search during the elimination
and exhaustive phases are also determined by the user.
2.3.2 Problem representation
As in AutoDock, the ligands conguration is encoded by real values that
dene its position in space (translation), its orientation (rotations about
axes), and the internal rotations around single bonds. Unlike in AutoDock,
we have decided to use three degrees of freedom to describe the rotations of
the ligand around the principal exes. In our implementation, the rotation is
dened by sequential rotations of the molecule around the X, Y and Z axes
(in this order).
2.3.3 Protein exibility
Accounting for protein exibility is a very important task, which, until re-
cently was ignored by the majority of current protein-ligand docking programs[34,
97]. Proper inclusion of exibility (as a set of rotations around side-chain
Methods 47
Figure 2.1: Tearing o atoms to represent side chain exibility using phenylalanine as
an example. Dummy atoms are marked by the letter D in their names. The N, C
and
C
atoms on the receptor molecule overlap with their respective dummy counterparts.
and main-chain bonds) requires extensive changes to the current source code
of ISE-dock and is thus beyond the scope of this work. Nevertheless, before
further work is done, it is important to assess the ability of ISE to cope with
this problem. Docking experiments that account for protein exibility that
are presented in this work serve as a demonstration of ISE capabilities.
Side chain exibility in ISE-dock
As was previously described in Section 2.1 (page 35), the grid-based energy
function implies that the entire protein remains frozen during the docking
simulation. To overcome this limitation I have decided to transfer selected
atoms from the protein to the ligand, as the ligand may be treated with ex-
ibility. This is a technical choice to overcome that limitation of the original
program. Figure 2.1 describes the process. First, a set of exible residues
is identied using previous knowledge. Then, for each exible residue, all
the side-chain atoms, except for C
, C
, C
and C
dene
1
;
C
, C
, C
, and C
dene
2
and so on. In order to prevent clashes penalty
due to the overlapping, the common atoms on the ligands size are marked
as dummy atoms. Dummy atoms are ignored during energy calculations.
All the atoms that originally belonged to the receptor molecule are excluded
from the operations of translation and rotation, thus only the dihedral angles
change during the ISE search.
The transfer of atoms from the receptor to the ligand breaks a cova-
lent bond between C
and C
is considered a part
of another molecule. This means that C
atom is also
marked as dummy. This measure means that C
s from energy
calculations uneventfully leads to loss of accuracy. To test the validity of the
tearing o approach, I have docked only the side chains, with a ligand
molecule xed in its crystallographic position. In these experiments (data
not shown), the RMSD of the side chain atoms with respect to their observed
position was below 0.3
A.
Methods 49
Backbone exibility in ISE-dock
The tearing o approach that was undertaken to include side chain exi-
bility of proteins isnt suitable for exibility of the backbone due to various
technical limitations that are posed by the original code of AutoDock. In
this work, multiple protein conformations were used as a target bank for
the docking process. The multiple conformations of the protein were gener-
ated using the Iterative Stochastic Elimination algorithm[76, 86]. The ligand
is docked separately to each of the generated protein conformations, which
is kept frozen as usual. The results are combined according to the energy
values.
2.4 Rigid protein docking
2.4.1 LGA docking
LGA docking has been proposed to be superior to other methods in Auto-
Dock [70]. We have used the original (unmodied) AutoDock program to
obtain the results for LGA. As already mentioned, we substituted the default
Pseudo Solis Wets local optimization by the original Solis Wets algorithm.
We have also changed the default solution size from 10 to 35 in order to allow
AutoDock to perform as many energy evaluations ( 8.8 10
6
) as were
performed on the average by ISE ( 8.6 10
6
).
Methods 50
2.4.2 The data set
We used the public portion of the test set used by Perola et al[84] in their
comparison of docking algorithms. The original test set consisted of 150
pharmaceutically relevant protein-ligand structures, of which 100 are pub-
licly available. The preparation process of these structures was performed by
the Perola group[84]. We converted these les to mol2 format. Protein struc-
tures were kept in their bound conformation and were assigned charges from
the Kollman (United Atoms) forceeld [111, 112]. In this forceeld, heavy
atoms and the non-polar hydrogen atoms adjacent to them are treated as
single (united) spheres and the only hydrogen atoms that are accounted for
individually are the polar ones. Ligands, co-factors and metal ions were
assigned charges using the Gasteiger-H uckel method [28], which, unlike the
former, treats all the atoms separately. Charges assignments were performed
using Sybyl
(R)
7.1. Ligand rotatable bonds were marked by visual examina-
tion. After the preparation, any existing co-factors were merged with the
protein and treated as part of the appropriate protein model. Atom types
were assigned automatically by the appropriate utilities in AutoDock suite.
Of the 100 complexes, 19 were excluded due to the following reasons:
1 complex (PDB code: 830c) containing both zinc and calcium
6 complexes with a co-factor that contains Phosphorus atoms (due to
lack of validated parameters): 1aoe, 1dib, 1dlr, 1frb, 1syn, 7dfr
8 complexes with ligands that contain more than 8 atom types (this
limitation is imposed by AutoDock) 1qwx, 1ls, 1mq5, 1mq6, 1gl9,
1ydt, 2csn.
Methods 51
Table 2.1: PDB codes of the 81 complexes in the rigid protein test set.
13gs 1cim 1f0r 1h1s 1k1j 1nhu 1ydr 5std
1a42 1d3p 1f0t 1h9u 1k22 1nhv 1yds 5tln
1a4k 1d4p 1f4e 1hdq 1k7e 1o86 2cgr 7est
1a8t 1d6v 1fcx 1hfc 1k7f 1ppc 2pcp 966c
1afq 1efy 1fcz 1hpv 1kv1 1pph 2qwi
1atl 1ela 1fjs 1htf 1kv2 1qbu 3cpa
1azm 1etr 1fkg 1i7z 1l8g 1qhi 3erk
1bnw 1ett 1fm6 1i8z 1lqd 1qpe 3ert
1bqo 1eve 1fm9 1if7 1m48 1r09 3std
1br6 1exa 1g4o 1iy7 1mmb 1thl 3tmn
1cet 1ezq 1h1p 1jsv 1mnc 1uvt 4dfr
4 structures with incomplete protein structure in proximity to the lig-
and (cuto: 10
specicity pocket.
Methods 58
Figure 2.2: Structural alignment of 456c and 966c. Backbone traces of the proteins are
color coded according to the distance (in
A) between the aligned backbone atoms. RS-
130830 (red) and RS-104966 (green) are shown as sticks models.
Comparisons and their analysis
Our exible backbone docking involves initial prediction of loop positions,
rigid docking of the ligand to these multiple loops and then combining the
results into a single set. The computational eort that is involved in this
multistep methodology is much greater than the computational cost of rigid-
protein docking. Due to the need to apply a few programs in order to obtain a
set of nal results, the eect of the additional investment of CPU time cannot
be assessed nor isolated. Therefore we do not compare exible-backbone
docking to rigid protein docking.
Protein backbone conformations of fragments or loops are produced by
applying ISE to the structure of the protein in the protein-ligand complex,
without the presence of ligand. To evaluate the results, we compare the frag-
Methods 59
ment conformations to the original loop/fragment conformation in the com-
plex. We compare by measuring backbone atoms deviations (using RMSD).
For the ligand, its predicted position is compared to the one observed crystal-
lographically using RMSD of heavy atoms. Ligand RMSD of the top scored
conformation, best RMSD in top 20 and in all available solutions are re-
ported. Ideally, RMSD of all movable atoms (protein backbone, side chain
atoms and the ligand) needs to be calculated. To calculate RMSD over this
set of atoms, one needs to take into account the numerous local axes of sym-
metry present in any protein-ligand complex. Phenyl rings, carboxylate and
guanidine groups are examples of substructures that contain such axes. Cor-
rect accounting for symmetry axes is a complex combinatorial problem with
an exponential complexity. Due to the preliminary nature of exible protein
docking experiments and in order to simplify the process of evaluation, I de-
cided to use two values simultaneously: RMSD of the ligand and RMSD of
protein backbone atoms.
2.5.2 Flexibility of a single side chain
Test case of acetylcholinesterase
General
Acetylcholinesterase (AChE) plays an important role in regulating the func-
tions of the central and peripheral nervous systems. This enzyme cleaves
acetylcholine, which is secreted by neuron vesicles into the synapse that sep-
arates the vesicle and the membrane of the next cell in line. Acetylcholine
encounters receptors on that membrane and activates the continuation of the
Methods 60
Figure 2.3: Cross section of AChE complexed with acetylcholine (PDB code: 2ace), colored
by (A) partial charge of the atoms and (B) by the residue type (colored by PyMol):
hydrophobic (GILMPV) white, aromatic (FWY) magenta, semipolar (C) yellow,
polar (HNQST) cyan, positive (KR) blue, negative (DE) red. Acetylcholine is
colored blue in both panes.
neuronal transmission. AChE cleaves acetylcholine in a two step reaction into
choline and acetate, thus terminating the signal. The catalysis occurs in a
very deep, electron-rich, binding pocket, which is also called the gorge (see
Figure 2.3). The protein structures of AChE is complexed with Huperzine
A (PDB code: 1vot) and with Aricept (PDB code: 1eve) dier mainly in
the position of the side chain of one residue, Phe 330 (Figure 2.4)[97]. When
AChE is complexed with Huperzine A (1vot), Phe 330 adopts the confor-
mation that keeps the binding gorge closed. When, on the other hand, the
bulkier Aricept molecule is present in the complex (1eve), Phe 330 adopts
a conformation that allows the entry of this bigger ligand to the binding
pocket. The dierence between the two conformations in the
1
angle (1eve
105.3
o
; 1vot 58.9
o
).
Comparisons and their analysis
To asses the performance of ISE-dock, results of rigid-protein docking and
cross-docking to AChE (1eve and 1vot) are compared to those obtained by
Methods 61
Figure 2.4: AChE complexed with Huperzine A (PDB code: 1vot, light gray) and with
Aricept (PDB code: 1eve, dark gray). The ligands and Phe 330 side chains from both the
complexes are highlighted using sticks.
exible docking. A total of 4 cross docking experiments are performed with
each method. The comparison is done using RMSD of the ligand only (heavy
atoms) due to the very strong similarity between the backbones of 1eve and
1vot, diering by only RMSD 0.2
ISE LGA Glide GOLD ISE LGA Glide GOLD ISE LGA
Minimum 0.52 0.39 0.3 0.41 0.25 0.31 0.3 0.34 0.2 0.31
Maximum 5.99 5.95 10.63 10.19 2.46 3.65 10.36 6.35 1.64 2.74
Mean 1.73 1.9 2.57 3 0.98 0.99 1.49 1.56 0.73 0.89
Median 1.33 1.55 1.63 2.17 0.84 0.81 1.11 1.1 0.69 0.72
SD 1.14 1.39 2.58 2.44 0.51 0.62 1.44 1.26 0.37 0.5
P(PTT) 0.09 0 0 0.46 0 0 0.01 0.006
A. In the remaining
threshold values, ISE and LGA outperform (with various degrees) Glide and
GOLD with respect to the number of structures with top scored solutions
below the corresponding threshold. For thresholds above 1.0
A, there is a
slight advantage of ISE over LGA, which increases for larger threshold values.
About 70% (65% for LGA) of the top scoring structures are found by ISE-
dock to be under 2.0
A.
The mean and median RMSD values for the top scored poses, as well as
the standard deviations, are better with ISE than LGA, Glide or GOLD.
The PTT for ISE results vs the others are: LGA: P=0.09, Glide: P=0.002
and GOLD: P< 0.001. P is the probability that the dierence between the
algorithms is random, as calculated by PTT.
Top scoring poses are complexes of best interaction energy, and are ex-
pected to show the lowest RMSD from experimental. However, they are
frequently found to have larger RMSD values due to (1) limited inclusion
of exibility and (2) limitations of the scoring functions, which compromise
between speed and quality. Still, these scoring functions are expected to be
good enough to identify the best answers among the top results for a docking
Flexible ligand rigid protein docking 67
Figure 3.1: Top single docking poses at dierent RMSD bins with respect to crystal
structures, 4 dierent programs. Results for Glide and GOLD were obtained by Perola et
al.[84].
experiment, and the number 20 was chosen[84] to probe for such best RMSD
results.
Flexible ligand rigid protein docking 68
3.2 Top 20 poses
Comparison of top 20 poses demonstrates that ISE-dock outperforms both
Glide and GOLD and shows better or similar performance, compared to
AutoDock s LGA. The mean and the median RMSD values of the best out
of the top 20 poses obtained by ISE are similar to those obtained by LGA
and are better than those obtained by the other two algorithms. Pairwise
comparison shows that the performances of ISE and LGA on the top 20 poses
are identical (P=0.46). Examination of the best 20 docking poses shows that
ISE is clearly better than Glide and GOLD, with a probability P0.001
with respect to any of these two (see Figure 3.1). Figure 3.2 demonstrates
that LGA and ISE have an advantage over Glide and GOLD for the top 20
poses in all RMSD ranges. ISE results for 0.5
A, 2.0
A and 3.0
A thresholds are
better than those of LGA. ISE alone produced at least one 3.0
A or better
solution among the top 20 poses in the entire test set (100.0% compared to
97.5%, 90.1% and 87.6% for LGA, Glide and GOLD, respectively). In 98%
of the examined molecules, ISE produced solutions that are closer than 2.0
A
from experimental. Examination of the top 20 poses is most meaningful for
comparing between the programs, as it appears to indicate that the sampling
conducted by ISE-dock is indeed more thorough than the sampling of the
other programs.
Flexible ligand rigid protein docking 69
Figure 3.2: Top 20 docking poses, RMSD to corresponding crystal structures. Results for
Glide and GOLD were obtained by Perola et al.[84].
3.3 Solution space coverage
ISEs ability to generate very large populations of near-optimal solutions re-
sults in much better coverage of solution space near the (global) minimum.
This is borne out by comparing best RMSD in the full set of solutions by
ISE and LGA in similar CPU time (4096 and 35 solutions, respectively). The
population obtained in standard runs of ISE is larger than that obtained by
LGA by more than a 100-fold. This increases signicantly the chance of
nding docking poses with lower RMSD values. It is reasonable to compare
populations that dier that much in size, as we show in the discussion of
alternative binding modes in the results section. I could not compare ex-
tended docking populations for Glide and GOLD, as no such data were
reported. It should be emphasized that ISEs 4096 solutions in this case, and
any number of solutions in other cases, are not merely poses encountered
Flexible ligand rigid protein docking 70
during the random search, but are the best ones following the probing of the
whole space. The PTT probability value for comparison of the two docking
sets is P=0.006. ISE results are better with respect to all the terms in the
ve-number summary (minimum, maximum, average, median and standard
deviation) of the best RMSD in the entire solution set (Figure 3.1). When
examining the percentage of complexes with at least one solution below a
certain threshold, as depicted on Figure 3.3, the most prominent dierence
between the algorithms is at 0.5
A. LGA succeeded
to dock all the complexes with at least one solution below 3.0
A. These nd-
ings suggest that populations docked by ISE, combined with a more accurate
scoring technique, may lead to better detection and identication of relevant
docking results.
The ISE docking population (comparing by CPU time, 4096 top solu-
tions of ISE vs 35 of LGA) is much more diverse in its poses than that
produced by LGA. We clustered the poses using Sequential Leader Clus-
tering algorithm[36], with a default distance criterion of 1.0
A. The average
number of clusters for the 81 molecules is 1870 for ISE and 14 for LGA.
Flexible ligand rigid protein docking 71
Figure 3.3: Top available docking poses produced in equal CPU times, RMSD to corre-
sponding crystal structures. The numbers of poses are 4096 (ISE) and 35 (LGA).
3.4 Time performance
We used the time performance of ISE and LGA in order to choose approxi-
mately equal processing times and analyze the number of solutions obtained
in that span of time. The average time needed to obtain 4096 docking so-
lutions on an Intel
R
Xeon
TM
3 GHz computer, using ISE with the current
settings, was about 7.5 minutes. The average time needed to obtain 35 so-
lutions using LGA was about 8.3 minutes. As mentioned above, the time
required by LGA is linear with the number of solutions. Thus, it is expected
that more than 16 hours are required to obtain 4096 docking solutions with
LGA. For AutoDock, it has been recently suggested to increase the reli-
ability of results by obtaining more solutions and by increasing the number
of evaluations[66]. Such an increase has a substantial toll in computer time,
which is absent in ISE. We could not compare the time performance of ISE-
Flexible ligand rigid protein docking 72
Figure 3.4: Number of iterations before switching to exhaustive search as a function of
initial combinatorial size (number of initial combinations).
dock to those of Glide and GOLD. Results for the quality of the solutions
with these programs are reported here as they appear in Perola et al.[84].
The initial number of total possible combinations for ISE docking ranged
from 1,012 to 1,034 depending on the number of ligand rotatable bonds,
ranging between 2 and 14. The number of iterations (between 50 and 76
for dierent molecules) needed to reduce the size of the problem below the
threshold (105 combinations for switching to exhaustive computations) is ap-
proximately linear with respect to the logarithm of the initial problem size.
The graph that describes this relationship is shown in Figure 3.4. Based on
that linearity, it should be possible to extend the number of variables and
values to include protein side chains, main chain angles as well as additional
degrees of freedom.
Flexible ligand rigid protein docking 73
3.5 Multiple binding modes
A growing body of data supports the existence of multiple binding modes of
ligands to receptors[18, 9, 44, 91, 57, 27, 35, 81]. In order to learn about mul-
tiple binding modes from ISE-dock, the shape of energy landscapes around
minima in energy vs RMSD graphs of ISE results is examined. These plots
may be roughly divided by visual examination into three groups: those with
one distinct funnel, those with multiple funnels and those with no distinct
funnel. It has been suggested[62] that existence of a single canyon at the
bottom of the energy landscape corresponds to a stable structure, multiple
minima might indicate the existence of multiple binding modes, and rugged
and unshaped energy vs RMSD plots may be the result of a looser or non-
specic binding, induced t phenomena or domain swapping.
Figure 3.5A shows a representative of a few complexes that appear on
energy vs RMSD plots with a single funnel-like region (PDB code 1yds).
As expected, in this case, the docking solutions are structurally close to
the crystallographic pose and to one another (Figure 3.5B). Figure 3.6A
demonstrates an energy vs RMSD plot with two funnels (PDB code 1bqo),
while Figure 3.7A shows such a plot with no distinct funnel (docking results
of 1hpv). As one may see from Figure 3.7B, there are at least two predicted
binding modes for this complex, which is in agreement with our previous
suggestion. In Figure 3.7B, the ligand positions are spread over a large
conformational variation. Energy vs RMSD plots of the entire data set of
81 complexes after a single docking run are presented in Figures C.1 C.7
(Appendix C.2).
Flexible ligand rigid protein docking 74
Figure 3.5: A: Energy vs RMSD plot for docking populations of the complex 1yds obtained
with ISE, showing a single distinct funnel. B: the same plot for 35 solutions obtained by
LGA. The plots are shown using the same scale. C: The rst 35 solutions (dark lines)
docked by ISE vs the ligand in the crystal (gray sticks). Receptor residues with at least
one atom within 5.5
A of the ligand are shown as light gray cartoon. All structures in this
work were visualized using PyMol[15].
In 27 cases (34%), the span of energy for 4096 solutions between the
global minimum (GM) and docking solution of highest energy is less than 5
kcal/mol. In 50 cases (61%), all 4096 solutions are within 5 15 kcal/mol
from the GM, and in only 4 cases (5%), the energy spread is larger than 15
kcal/mol. Figure 3.8 shows the cumulative percentage of solutions (for 81
complexes, each with 4096 poses) with increasing energy gaps from the GM,
thus clarifying that most conformations are close to the GM. These 4 plots
Flexible ligand rigid protein docking 75
Figure 3.6: A: Energy vs RMSD plot for docking populations of the complex 1bqo obtained
with ISE, showing two distinct funnels. B: the same plot for 35 solutions obtained by LGA.
The plots are shown using the same scale. C: The crystal structure of the ligand (gray
sticks) and the rst 35 solutions (dark lines) docked by ISE.
with high energy minima (1fm9, 1hpv, 1qbu, 3std), have (as 3.7A) no distinct
funnel. The docking poses of these 4 complexes have no single binding mode,
but are disperse. The main feature of these complexes is the deeply buried
ligands in binding pockets (data shown for 1hpv, Figure 3.7).
Flexible ligand rigid protein docking 76
Figure 3.7: A: Energy vs RMSD plot for docking populations of the complex 1hpv obtained
with ISE, showing a scatter of the results. B: the same plot for 35 solutions obtained by
LGA. The plots are shown using the same scale. C: The crystal structure of the ligand
and the rst 35 solutions docked by ISE.
Flexible ligand rigid protein docking 77
Figure 3.8: Cumulative fractions (Y-axis) of 81 ISE docking complexes with an energy
span between the global minimum of each (pose number 1) and the other 4095 poses,
below the given threshold (X-axis).
Flexible ligand rigid protein docking 78
3.6 PDB data supports distinct funnels
Twenty four plots with multiple distinct funnels are found in our test set
(1azm, 1bqo, 1cim, 1eve, 1f4e, 1fm6, 1h1p, 1h9u, 1hdq, 1if7, 1iy7, 1jsv, 1k7e,
1kv1, 1qhi, 1qpe, 1r09, 1uvt, 1ydr, 3cpa, 3std, 4dfr and 5std). Ligands of
two of the twenty four complexes are present in the PDB in complexes with
other proteins (5-acetamido-1,3,4-thiadiazole-2sulfonamide from 1azm in 9
complexes; 6-O-cyclohexylmethyl guanine from 1h1p in 2 complexes) but
display similar binding modes in all of them. One complex (3cpa) contains
glycyl-tyrosine as a ligand, which is not searchable in the PDB as it is not rec-
ognized as a hetero compound. Two complexes contain related structures
same or similar proteins with dierent ligands (1f4e, 1kv1). Of these two,
I would like to concentrate on p38 MAP kinase that was crystallized with an
inhibitor (PDB code: 1kv1; ligand HET ID: BMU)[80]. Another structure
of the same protein exists in the PDB bound to a structurally dierent lig-
and (PDB code: 1kv2, ligand HET ID: B96)[80]. Figure 3.9 demonstrates
that those ligands bind in two dierent modes. The ligand in 1kv2 is much
larger (527 g/mol) than the ligand in 1kv1 (306 g/mol). An additional no-
ticeable dierence between the two ligands is that the toluyl group of 1kv2
is positioned in the place of the CH
2
pyrrole group of the ligand in 1kv1.
The energy vs RMSD plot for the 1kv1 complex (Figure 3.10) displays
three distinct funnels with solutions ranked 1, 222 and 270 at their bottom
(marked d1, d222 and d270). These three poses are summarized in Figure 3.2.
As may be seen in Figure 3.11, the top scored pose is close to the crystal
structure position (RMSD of 1.37
A)
1 -10.51 1.37
222 -9.13 3.95
270 -8.84 4.69
binding mode. Figure 3.14A shows the energy vs RMSD plots for ISE and
LGA docking solutions of 2rox.
Flexible ligand rigid protein docking 82
Figure 3.13: ISE-dock solution for 1kv1 solution ranked 270 (sticks). The crystal structures
of 1kv1 and 1kv2 ligands are shown for comparison (lines). The coloring scheme is identical
to that on Figure 3.11
In AutoDock, the lower number of solutions supplied by LGA compared
to ISE in similar CPU time provides fewer suggestions for ligand binding
modes. This is further emphasized by the smaller number of clusters of LGA
docking compared to ISE-dock, which covers solution space better than
LGA in a similar CPU time. Large ISE populations may thus improve upon
the imperfections in the energy functions.
Flexible ligand rigid protein docking 83
Figure 3.14: Energy vs. RMSD plot for docking populations of the complex 2rox, obtained
by ISE (A) and LGA(B). The best single ISE solutions at each of the two funnels have
ranks 1 and 2 and are marked with arrows. C: Antiparallel docking solutions ranked 1 and
2 for 2rox (green and magenta sticks respectively). The carbons in the crystal structure
of thyroxine are shown thin sticks colored cyan. The backbone of closest (within 5.5
A)
residues to the ligand are shown in PyMol cartoon representation colored cyan.
Chapter 4
Flexible Ligand Flexible Protein
Docking
4.1 Protein backbone exibility test case of
collagenase
The coordinates of MMP13 and MMP1 (456c and 966c) were obtained
from the PDB. All the water molecules and metal ions, except for the cat-
alytic Zinc were removed. The ligands were separated from the protein and
saved in a separate le. As the 456c structure contains two identical chains,
only one of them (A) was used. Alternate positions of the conformation-
ally exible loops (residues 248 253 for 456c and residues 244 247 for
966c) were produced by ISE. As any ISE implementation produces multi-
ple near-optimal solutions, only the conformations that dier from the best
scored one (global minimum) by not more than 5 kcal/mol were chosen for
84
Flexible Ligand Flexible Protein Docking 85
the next step. For 966c, there were 31 such solutions and RMSD of backbone
atoms with respect to the crystallographic structure (over the exible region
only) ranged between 0.09
A and 0.33
A and 0.61
A
(456c-966c)
a
and 2.20
A and 3.49
A for
456c and 1.18
A was
a
In this work, the names of cross docking experiments follow the [ligand name]-[receptor
name] template
Flexible Ligand Flexible Protein Docking 86
Table 4.1: Collagenase data set, best ligands RMSD (
A for 1vot.
The best RMSD values among the top 20 solutions were 0.63
A and 0.86
A
for 1eve and 1vot, respectively. When no protein exibility was allowed,
cross docking experiments, as expected, gave worse results than the native
(bound) docking. A decrease in the quality of the results was observed when
Aricept (1eve), the larger of the two ligands, was cross-docked into the protein
structure that was solved in complex with Huperzine A(1vot). The RMSD
value for top ranked solution in that case was 2.91
A.
Flexible protein docking Cross docking When protein side chain (Phe
330) exibility was allowed, cross docking of Aricept resulted in minor im-
provements of RMSD values in the three tested parameters. On the other
hand, in the cross docking of Huperzine A, the top 1 and the top 20 solu-
tions had worse RMSD values, compared to those obtained by rigid cross
Flexible Ligand Flexible Protein Docking 88
Table 4.2: Results of Acetylcholineesterase cross docking experiments (RMSD[
A]). The
results are reported for the best scored solution (Top 1) and the best RMSD values out
of the top 20 and out of all the available solutions (Top 4096). The ligand structures are
listed in rows and the protein structures are listed in columns.
Rigid docking Flexible docking
Ligands position All movable atoms
1eve 1vot 1eve 1vot 1eve 1vot
Top 1 1eve 1.85 2.91 1eve 2.17 2.12 1eve 1.95 1.85
1vot 1.09 0.86 1vot 2.60 0.72 1vot 2.28 0.70
1eve 1vot 1eve 1vot 1eve 1vot
Top 20 1eve 0.63 1.97 1eve 1.87 1.59 1eve 1.55 1.19
1vot 1.03 0.81 1vot 2.47 0.70 1vot 2.14 0.68
1eve 1vot 1eve 1vot 1eve 1vot
Top 4096 1eve 0.39 1.43 1eve 1.29 1.40 1eve 0.48 0.85
1vot 0.65 0.54 1vot 0.45 0.24 1vot 0.52 0.37
docking. However, a much closer to experimental ligand pose was found for
Huperzine A among the entire docking solution, with an RMSD of 0.45
A,
compared to 0.65
A for
1vot-1eve cross-docking. On the other hand, the top solution and the top
20 solutions in the cross-docking cases relatively of high RMSD. Figure 4.1
demonstrates the results of unbound docking for the AChE data set.
Flexible Ligand Flexible Protein Docking 89
Figure 4.1: The best available docking solution for (A) 1eve-1vot and (B) 1vot-1eve in
unbound (cross-) docking experiments. The docking solutions for all the movable atoms
are shown as lines and the crystal structures are shown as sticks. The protein structures
are shown as backbone trace.
Flexible protein docking Bound docking When exibility of Phe330
was included, the quality of bound docking results for Aricept (1eve-1eve)
were worse, compared to those obtained without protein exibility. Ligands
RMSD values for the top scored solution, the best out of top 20 and the
best available solution were respectively 2.17
A, 1.87
A and 1.29
A. In the case
of Huperzine A bound docking (1vot-1vot), there was a slight improvement
in the prediction of ligand position: 0.72
A vs 0.86
A vs 0.81
A vs 0.54
A for best
available solution. The decrease in quality of bound docking results upon the
introduction of exibility (as was observed in the case of 1eve-1eve), can be
related to the increase in problem complexity. On the other hand, Phe330
exibility during docking of Huperzine A into a closed pocket (1vot-1vot) may
have solved minor clashes and as a result, gave in better results. Figure 4.2
illustrates the results of bound docking for the AChE data set.
Flexible Ligand Flexible Protein Docking 90
Figure 4.2: The best available docking solution for (A) 1eve-1eve and (B) 1vot-1vot in
bound docking experiments. The docking solutions for all the movable atoms are shown
as lines and the crystal structures are shown as sticks. The protein structures are shown
as backbone trace.
4.3 Flexibility of several side chains Test case
of trypsin
The RMSD values of torsional angles of the three residues that were treated
as exible in this work are listed in Figure 4.3. The structural dierences
between the proteins along the data set (in terms of torsional RMSD values)
range from 2.7
o
(1tng 1tnh) to 62.1
o
(1ppc 3ptb).
Cross docking of the 10 PDB structures resulted in 100 dierent docking
experiments. The detailed results of all the experiments are listed in Ap-
pendix D. RMSD of top scoring poses, the best RMSD in top 20 poses and
the best RMSD of all the available poses are reported and analyzed in Ta-
ble 4.4 and Figure 4.3. These results are assigned to RMSD threshold bins.
The bins are identical to the ones that were used in the rigid protein docking
experiments (Section 4.2, page 87).
The overall results of cross docking over the trypsin data set are good.
Contrary to the intuitive expectation, the RMSD values over the diagonals
Flexible Ligand Flexible Protein Docking 91
Table 4.3: Torsion RMSD (in degrees) of exible residues in the trypsin data set
1ppc 1pph 1tng 1tnh 1tni 1tnj 1tnk 1tnl 1tpp 3ptb
1ppc 0.0 43.4 42.6 42.1 43.0 40.7 40.3 41.0 61.5 62.1
1pph 43.4 0.0 35.5 34.5 36.7 34.0 33.3 34.3 58.3 61.0
1tng 42.6 35.5 0.0 2.7 4.3 4.8 5.0 4.6 48.1 37.6
1tnh 42.1 34.5 2.7 0.0 3.5 4.5 3.7 3.3 48.6 37.9
1tni 43.0 36.7 4.3 3.5 0.0 6.9 6.1 4.2 48.0 36.9
1tnj 40.7 34.0 4.8 4.5 6.9 0.0 2.8 4.4 50.6 40.9
1tnk 40.3 33.3 5.0 3.7 6.1 2.8 0.0 3.3 48.8 39.6
1tnl 41.0 34.3 4.6 3.3 4.2 4.4 3.3 0.0 49.2 39.8
1tpp 61.5 58.3 48.1 48.6 48.0 50.6 48.8 49.2 0.0 31.7
3ptb 62.1 61.0 37.6 37.9 36.9 40.9 39.6 39.8 31.7 0.0
Color map: 0 6 12 18 24 30 36 42 48 54 60
Figure 4.3: Top docking poses at dierent RMSD bins with respect to crystal structures
of Table 4.4 (bound docking) are frequently not the minimum ones. The
ligand from the 1tng complex is docked to all the protein structures with
Flexible Ligand Flexible Protein Docking 92
lower RMSD values, compared to the remaining ligands. On the other hand,
the ligand from 1tpp has the highest RMSD values. The detailed docking
results for the trypsin data set are listed in Table D.1 in the Appendix.
No protein-ligand combination could be docked with top scored solution
below RMSD of 0.5
A), colorcoded
Receptor
Ligand 1ppc 1pph 1tng 1tnh 1tni 1tnj 1tnk 1tnl 1tpp 3ptb
Top 1
1ppc 1.7 3.4 2.8 3.0 2.6 2.0 2.7 3.0 3.4 2.5
1pph 3.9 4.7 4.6 4.3 4.5 3.9 4.3 2.8 4.5 3.6
1tng 1.0 1.1 0.5 0.6 1.0 0.8 0.9 0.6 1.0 1.0
1tnh 3.4 4.4 2.8 3.4 2.6 2.1 3.5 2.1 2.1 2.8
1tni 3.0 3.1 2.8 2.3 2.5 2.3 4.1 3.8 4.1 2.7
1tnj 4.0 3.2 3.8 2.5 3.2 2.3 2.8 2.5 2.2 3.5
1tnk 4.7 4.3 2.9 3.6 2.7 2.6 3.5 4.4 3.7 3.1
1tnl 4.5 2.1 3.0 2.6 2.7 1.8 2.7 3.3 2.1 3.4
1tpp 4.8 5.6 5.2 4.6 4.6 4.5 5.5 5.2 4.2 6.0
3ptb 3.0 3.3 3.2 3.7 3.2 2.7 3.5 3.1 3.1 2.8
Top 20
1ppc 0.9 2.5 1.4 1.9 1.5 1.3 1.1 1.3 1.6 1.7
1pph 2.4 2.1 2.0 2.2 2.0 2.0 2.7 2.0 2.5 2.2
1tng 0.6 1.0 0.4 0.5 0.6 0.5 0.6 0.5 0.6 0.8
1tnh 1.6 1.5 1.4 1.4 1.4 1.3 1.4 1.4 1.5 1.8
1tni 2.0 1.9 1.6 1.7 1.7 1.8 1.6 1.5 1.9 1.9
1tnj 2.2 1.8 1.5 1.5 1.7 1.4 1.4 1.5 1.7 1.7
1tnk 1.9 1.8 1.7 1.6 1.6 1.6 1.4 1.7 1.7 1.6
1tnl 1.9 1.5 1.3 1.3 1.3 1.3 1.4 1.3 1.4 1.4
1tpp 3.1 2.7 4.5 3.4 4.4 3.5 4.1 4.1 2.6 4.0
3ptb 1.8 1.6 2.6 2.3 2.4 2.3 1.9 2.4 1.9 2.2
Top 4096
1ppc 0.9 1.4 1.3 1.6 1.1 1.2 1.0 1.3 1.3 1.4
1pph 2.0 1.7 1.7 1.7 1.6 1.6 1.8 1.4 1.9 1.8
1tng 0.4 1.0 0.3 0.4 0.6 0.4 0.5 0.4 0.5 0.7
1tnh 1.3 1.1 1.1 1.2 1.2 1.1 1.1 1.2 1.2 1.3
1tni 1.3 1.4 1.2 1.4 1.3 1.2 1.2 1.1 1.5 1.2
1tnj 1.3 1.4 1.1 1.2 1.3 1.1 1.1 1.0 1.3 1.2
1tnk 1.6 1.4 1.4 1.2 1.4 1.3 1.2 1.4 1.4 1.4
1tnl 1.1 1.2 1.1 1.2 1.1 1.2 1.1 1.0 1.4 1.2
1tpp 2.1 1.9 2.2 1.8 2.5 2.0 2.0 2.7 1.8 2.8
3ptb 1.0 1.1 1.0 1.2 0.9 1.4 1.2 1.1 0.8 1.2
Color map: 0 0.3 0.6 0.9 1.2 1.5 1.8 2.1 2.4 2.7 3
Flexible Ligand Flexible Protein Docking 94
4.4 Discussion on protein exibility
Accounting for protein exibility introduces additional degrees of freedom,
but is a more realistic representation of biological systems. Until recently,
most major docking programs have been ignoring conformational variations
of side chains and backbone of the receptors[97]. Nevertheless, due to the
advances in the docking algorithms and in computational power, four out of
the ve most cited docking programs for year 2005[95] allow some extent of
protein exibility (Table 4.5). Therefore, any new proposed protein-ligand
docking program is expected to address protein exibility. Due to time
constraints, handling protein exibility by ISE-dock was implemented only
Table 4.5: Current status of protein exibility handling ISE-dock and in ve popular
docking programs (sorted according to the number of citations in 2005[95])
ISE-dock Explicit exibility of several side chains specied by the user.
Implicit handling of changes in the backbone using pregener-
ated populaions.
AutoDock No protein exibility in AutoDock ver.3. Recently released
ver.4 allows side chain exibility of selected residues
DOCK Protein exibility is not implemented
FlexX FlexX-Ensemble (formerly known as FlexE) an exten-
tion of FlexX. The exibility of the protein is represented
by an ensemble of structures, combined to a so-called united
protein description. It is possible to recombine elements from
dierent ensemble structures
GOLD Partial protein exibility, including protein side chains and
backbone exibility for up to ten user-dened residues
ICM Partial protein exibility, including protein side chains and
selected loops
Flexible Ligand Flexible Protein Docking 95
partially as a preliminary step before further development. In order to in-
clude protein exibility, the scoring function of AutoDock (and thus of
ISE-dock) was extended and applied to conditions that were not accounted
for during its construction and callibration. This application of the scoring
function in cases that dier dramatically from the ones that were used for
its construction and calibration was a trade o between the accuracy and
the speed of development in the proof of concept phase of development and
has direct impact on the quality of results. Although limited to small re-
gions, protein exibility handling in ISE-dock is successful and is another
demonstration of the ability of ISE do deal with multiple degrees of freedom
in protein-ligand docking problems. Indeed, docking experiments in all the
three test sets succeeded in producing high quality docking poses. The solu-
tions in the collagenase set contained ligand poses with ligand RMSD values
above 1.18
A (1tng-1tng).
The main pitfall of the exible ligand exible protein docking using
ISE-dock is the scoring function. The original energy function does not
account for changes in the 3D structure of the protein. Implicit protein
exibility (collagenase data set) involves combining solutions of docking a
ligand into dierent protein structures. Explicit handling of changes in the
protein 3D structure during the docking process involves transferring atoms
from the protein to the ligand and exclusion of C
A
among the top 20 scored poses (LGA of AutoDock nds 97.5%, Glide nds
90.1% and GOLD nds 87.7%), and with at least one RMSD<2.0
A within
the entire docking population (LGA nds 96.3%, no information is available
on Glide and GOLD). PTT of top 20 solutions and all the available solu-
tions, applied to the results of ISE-dock and to the other algorithms, shows
a clear advantage for ISE-dock.
The more signicant results of the exible ligand - rigid protein docking
experiments are provided by the ability of ISE to achieve large near-optimal
populations of solutions without a signicant additional CPU eort. These
populations improve the coverage of solution space and may be used to es-
timate the shape of energy landscapes near minima and to suggest multiple
binding modes, as was demonstrated in two cases (p38 MAP kinase 1kv1
and Human Transthyretin 2rox). The ability to analyze energy landscapes
accessible to ligands in a pocket has thus been shown to be useful. However,
the accuracy of that analysis can not be fully assessed yet due to the lack
of experimental data. Although, theoretically, such an analysis of very large
docking populations is possible with other docking programs, to the best of
our knowledge, the energy (score) vs RMSD plots of docking solutions, al-
Conclusions 99
though known previously were not used to visualize and estimate the energy
landscape of a protein ligand complex.
Accounting for protein exibility introduces additional degrees of free-
dom, but gives a more realistic representation of biological systems. Handling
of protein exibility was introduced into ISE-dock in a partial way. Even in
this premature implementation, ISE-dock was shown to successfully dock
exible ligands into partially exible protein structures, which include a few
side chains and consider backbone exibility. In all the cases, the docking
populaitions obtained by ISE-dock contained good to excellent solutions.
In the collagenase data set (Section 4.1, exible ligand were successfully
docked into protein structures with partially exible loops. The accuracy in
predicting the structure of the backbone is very high with RMSD of backbone
atoms as low as 0.13
A 2.49
A 1.76
A).
Docking experiments with side chain exibility (AChE, Section 4.2 and
trypsin, Section 4.3) were even more accurate: in the AChE case, the docking
populations contained solutions with RMSD values as low as 0.37
A and in the
case of trypsin, the best populaition contained a solution with RMSD=0.30
A.
The experiments presented in this work show that ISE is capable of solv-
ing very complex problems. In addition to molecular exibility, such prob-
lems may target protonation and tautomerizatioin states of both the protein
and the ligand, explicit simulation of water molecules etc. The latter task is
of great importance, as it is known (see for examples [85, 104]) that including
Conclusions 100
water molecules improves the quality of docking results. In order to equip
ISE-dock with all these important features, one has to overcome two major
obstacles: (1) adaptation of the grid based scoring function to correctly treat
conformational changes in the protein and (2) docking several molecules (or
any independent entities) simultaneously.
Appendix A
Results published in a peer
reviewed journal
Following is the letter from the editor of PROTEINS: Structure, Function,
and Bioinformatics journal that noties the fact that an article based on
this work has been accepted for publication.
Return-path: <onbehalfof@scholarone.com>
Envelope-to: boris@gorelik.net
...
Message-ID:
<439655644.1187888215280.JavaMail.wladmin@mcv3-wl18>
Date: Thu, 23 Aug 2007 12:56:55 -0400 (EDT)
From: PSFBeditor@jhu.edu
To: boris@gorelik.net
Subject: PROTEINS: Manuscript Prot-00274-2007.R1 Accepted
Cc: amiram@vms.huji.ac.il
Errors-To: proteins@jhu.edu, proteinsadmin@wiley.com
PROTEINS: Structure, Function, and Bioinformatics
23-Aug-2007
Dear Mr. Boris Gorelik:
Your manuscript entitled "High quality binding modes in docking
ligands to proteins" has passed all required peer review and has
been recommended to me by the Editorial Board. I am pleased
to accept the paper for publication in the next available issue of
PROTEINS.
101
Results published in a peer reviewed journal 102
You will receive an e-mail immediately following with instructions
for production of your article. I look forward to seeing it in press.
Congratulations on submitting such an excellent study.
Sincerely,
Eaton E. Lattman
Editor-in-Chief
PROTEINS: Structure, Function, and Bioinformatics
The Johns Hopkins University
Department of Biophysics
Baltimore, MD 21218 U.S.A.
Appendix B
ISE-dock and AutoDock
parameters and their values
B.1 AutoDock parameters and their
default values
Following are the default parameters of AutoDock v 3.0.5 and their short
description. For more details see the manual published by AutoDock au-
thors
seed time pid # for random number generator
types CANOSH # atom type names
fld [PROTEIN_NAME].maps.fld # grid data file
map [PROTEIN_NAME].C.map # C-atomic affinity map file
map [PROTEIN_NAME].A.map # A-atomic affinity map file
map [PROTEIN_NAME].N.map # N-atomic affinity map file
map [PROTEIN_NAME].O.map # O-atomic affinity map file
map [PROTEIN_NAME].S.map # S-atomic affinity map file
map [PROTEIN_NAME].H.map # H-atomic affinity map file
map [PROTEIN_NAME].e.map # electrostatics map file
move [LIGAND_NAME].pdbq # small molecule file
about [X],[Y],[Z] # small molecule center
# Initial Translation, Quaternion and Torsions
tran0 random # initial coordinates/A or "random"
quat0 random # initial quaternion or "random"
ndihe 10 # number of initial torsions
dihe0 random # initial torsions
torsdof 0 0.3113 # num. non-Hydrogen torsional DOF & coeff.
103
ISE-dock and AutoDock parameters and their values 104
# Initial Translation, Quaternion and Torsion Step Sizes
# and Reduction Factors
tstep 2.0 # translation step/A
qstep 50.0 # quaternion step/deg
dstep 50.0 # torsion step/deg
trnrf 1. # trans reduction factor/per cycle
quarf 1. # quat reduction factor/per cycle
dihrf 1. # tors reduction factor/per cycle
# Internal Non-Bonded Parameters
intnbp_r_eps 4.00 0.0222750 12 6 #C-C lj
[LENNARD JONES PARAMETERS FOR EACH PAIR OF ATOM TYPES]
intnbp_r_eps 2.00 0.0029700 12 6 #H-H lj
outlev 1 # diagnostic output level
# Docked Conformation Clustering Parameters for
# "analysis" command
rmstol 1.0 # cluster tolerance (Angstroms)
rmsref [LIGAND_NAME].pdbq # reference structure
# file for RMS calc.
write_all # write all conformations in a cluster
extnrg 1000. # external grid energy
e0max 0. 10000 # max. allowable initial energy,
# max. num. retries
# Genetic Algorithm (GA) and Lamarckian
# Genetic Algorithm (LGA) Parameters
ga_pop_size 50 # number of individuals in population
ga_num_evals 250000 # maximum number of
# energy evaluations
ga_num_generations 27000 # maximum number
#of generations
ga_elitism 1 # num. of top individuals that
# automatically survive
ga_mutation_rate 0.02 # rate of gene mutation
ga_crossover_rate 0.80 # rate of crossover
ga_window_size 10 # num. of generations for
# picking worst individual
ga_cauchy_alpha 0 # ~mean of Cauchy distribution
# for gene mutation
ISE-dock and AutoDock parameters and their values 105
ga_cauchy_beta 1 # ~variance of Cauchy distribution
# for gene mutation
set_ga # set the above parameters for GA or LGA
# Local Search (Solis & Wets) Parameters
# (for LS alone and for LGA)
sw_max_its 300 # number of iterations of
# Solis & Wets local search
sw_max_succ 4 # number of consecutive successes
# before changing rho
sw_max_fail 4 # number of consecutive failures before
# changing rho
sw_rho 1.0 # size of local search space to sample
sw_lb_rho 0.01 # lower bound on rho
ls_search_freq 0.06 # probability of performing local
# search on an indiv.
set_psw1 # set the above pseudo-Solis & Wets parameters
# Perform Dockings
ga_run 10 # do this many GA or LGA runs
# Perform Cluster Analysis
analysis # do cluster analysis on results
B.2 ISE-dock parameters and their
default values
Following are the default parameters of ISE-dock. Parameters that are
common to AutoDock are not listed here.
# ISE docking parameters
ise_sample_size -50 # sample size. negative values mean that
# the size will be the product of current pool depth and
# the absolute value of this parameter
ise_conf_in_h_l -2 # number of conformations in the
# highest- and lowest- energy subsets. negative values
# mean that the size will be the product of current pool
# depth and the absolute value of this parameter
ise_output_size 40 # number of solutions in the final
# docking set
ISE-dock and AutoDock parameters and their values 106
ise_z_value 3.84 # statistical value that determines
# the rigidity of the elimination process
ise_elimination_fraction 0.1 # limit the number of values
# that can be eliminated from any given gene
ise_threshold 1e5 # threshold to switch from the
# stochastic to the exhaustive search
ise_method stochastic # one of the following:
# stochastic exhaustive
ise_pool_file <use_dpf> # if file name is specified,
# read the initial pool from it if <use_dpf>, then
# use the *grid parameters listed below to initialize
# the possibilities pool
ise_t_grid 1.5 # translation grid
ise_r_grid 6 # rotation grid
ise_d_grid 6 # dihedral torsions grid
ise_optimize_solution FALSE # perform local
# optimization on the final docking solution
ise_optimize_on_elimination TRUE # perform local
# optimization during the elimination phase. use the
# value of ls_search_freq parameter for probability
# of performing local search
ise_optimize_on_exhaustive_freq 0.6 # probability
#of local search during the exhaustive phase
set_ise # set the above parameters
# Perform ISE docking
ise_run
# Perform Cluster Analysis
analysis # do cluster analysis on results
Appendix C
Detailed Results
C.1 Flexible Ligand Rigid Protein docking re-
sults results
Table C.1.
Top scoring pose Best RMSD
Top 20 All available
C
O
D
E
I
S
E
L
G
A
G
l
i
d
e
G
O
L
D
I
S
E
L
G
A
G
l
i
d
e
G
O
L
D
I
S
E
L
G
A
13gs 1.86 2.30 2.81 1.52 0.46 0.72 2.69 1.09 0.25 0.58
1a42 1.65 3.30 1.47 5.28 0.47 0.97 1.47 2.26 0.47 0.79
1a4k 1.88 1.91 2.29 2.33 1.50 1.54 1.38 1.81 0.76 1.46
1a8t 2.27 3.51 1.11 4.69 0.86 0.80 1.11 2.07 0.85 0.71
1afq 2.07 2.93 1.12 1.35 1.06 1.01 0.53 1.35 1.06 1.01
1atl 3.21 3.04 2.10 1.55 0.95 1.22 1.46 1.55 0.92 1.04
1azm 2.33 2.81 2.04 2.60 1.97 2.17 1.24 0.66 0.54 1.97
1bnw 3.93 4.21 4.36 4.88 1.03 3.02 1.35 4.30 0.61 1.12
1bqo 0.92 0.61 1.60 1.55 0.72 0.51 1.60 1.35 0.72 0.48
1br6 1.85 1.85 3.51 1.82 1.64 1.83 1.69 0.63 0.44 1.82
1cet 2.05 4.21 3.05 8.52 1.71 1.88 2.80 5.30 0.75 1.81
1cim 1.16 1.16 1.54 1.30 0.66 0.65 1.34 1.03 0.23 0.58
Continued on next page
107
Detailed Results 108
Table C.1 continued from previous page
Top scoring pose Best RMSD
Top 20 All available
C
O
D
E
I
S
E
L
G
A
G
l
i
d
e
G
O
L
D
I
S
E
L
G
A
G
l
i
d
e
G
O
L
D
I
S
E
L
G
A
1d3p 1.32 3.91 2.40 4.03 1.03 0.86 1.61 1.57 0.91 0.85
1d4p 0.98 1.56 2.35 2.69 0.74 0.86 0.74 0.99 0.50 0.79
1d6v 2.31 2.50 4.06 4.08 1.79 2.36 2.01 1.68 0.97 2.19
1efy 2.53 4.45 1.95 2.88 1.98 2.03 0.38 0.69 0.52 1.95
1ela 1.15 1.55 0.75 1.25 1.14 0.87 0.75 1.06 1.08 0.87
1etr 1.71 0.66 1.49 2.60 1.19 0.66 1.15 2.18 1.01 0.66
1ett 2.55 4.59 0.92 4.37 0.85 0.72 0.65 1.29 0.85 0.72
1eve 1.52 2.58 1.94 2.39 0.58 0.59 1.15 1.03 0.51 0.52
1exa 0.52 0.46 0.43 0.41 0.36 0.44 0.43 0.41 0.23 0.41
1ezq 2.65 2.19 10.63 2.25 1.68 1.06 4.30 1.10 1.58 1.02
1f0r 1.53 1.66 8.72 3.19 0.80 0.62 1.90 1.23 0.80 0.62
1f0t 1.24 4.84 2.26 2.12 0.84 0.89 1.60 2.06 0.84 0.89
1f4e 3.92 3.92 1.23 1.75 2.46 1.73 1 1.55 0.56 1.36
1fcx 0.58 0.58 0.48 0.74 0.50 0.55 0.48 0.49 0.20 0.53
1fcz 0.57 0.59 0.77 0.91 0.45 0.54 0.52 0.50 0.24 0.49
1fjs 1.49 1.59 5.04 2.12 1.31 0.73 3.44 1.44 1.31 0.73
1fkg 1.07 1.20 1.75 4.18 0.93 0.93 1.67 4.05 0.93 0.93
1fm6 2.84 0.40 0.64 0.68 0.69 0.35 0.64 0.65 0.69 0.35
1fm9 1.72 1.60 1.74 3.38 1.21 0.85 1.74 1.49 1.17 0.85
1g4o 3.70 3.99 2.15 4.59 2.21 2.92 1.62 0.81 0.58 2.44
1h1p 4.08 3.72 0.65 1.21 1.35 1.35 0.65 0.52 0.38 1.31
1h1s 0.80 0.62 0.97 1.16 0.61 0.42 0.97 1.16 0.58 0.36
1h9u 0.59 0.53 0.82 1.12 0.33 0.47 0.48 1.03 0.33 0.35
1hdq 1.07 1.88 2.16 3.67 0.55 0.84 0.62 0.84 0.37 0.77
1hfc 1.55 4.47 2.37 2.34 1.40 0.98 1 0.61 1.34 0.98
1hpv 1.11 1.73 1.20 9.47 1.01 0.88 1.19 1.38 1.01 0.88
Continued on next page
Detailed Results 109
Table C.1 continued from previous page
Top scoring pose Best RMSD
Top 20 All available
C
O
D
E
I
S
E
L
G
A
G
l
i
d
e
G
O
L
D
I
S
E
L
G
A
G
l
i
d
e
G
O
L
D
I
S
E
L
G
A
1htf 2.55 1.64 10.12 10.19 1.53 0.59 1.99 3.13 1.49 0.59
1i7z 0.87 1.02 0.60 0.86 0.45 0.82 0.44 0.82 0.45 0.38
1i8z 0.72 1.92 3.82 3.66 0.55 0.74 2.55 2.69 0.39 0.63
1if7 3.65 4.40 1.43 5.42 1.64 3.65 1.34 1.65 0.87 2.74
1iy7 0.96 1.04 1.16 0.91 0.75 0.99 0.99 0.59 0.75 0.77
1jsv 0.88 1.25 5.45 6.94 0.74 0.71 3.40 5.36 0.69 0.71
1k1j 4.11 1.47 5.88 6.54 1.59 1.23 4.48 3.24 1.57 1.23
1k22 1.69 0.55 0.74 1.03 1.06 0.42 0.74 0.72 1.06 0.41
1k7e 0.88 0.74 0.72 0.96 0.56 0.53 0.68 0.53 0.21 0.31
1k7f 0.79 0.77 2.02 0.84 0.69 0.68 0.51 0.76 0.69 0.66
1kv1 1.21 1.21 0.66 0.81 0.70 1.14 0.59 0.56 0.27 0.66
1kv2 0.73 0.78 1.63 0.80 0.58 0.69 0.91 0.74 0.52 0.63
1l8g 1.33 1.60 2.90 2.17 0.74 1.50 1.57 2.17 0.70 1.16
1lqd 0.89 0.39 1.93 0.65 0.74 0.31 1.93 0.45 0.74 0.31
1m48 1.89 1.12 0.68 1.64 1.10 0.55 0.68 1.12 1.10 0.55
1mmb 2.11 2.12 3.18 6.11 1.79 1.32 1.16 1.37 1.64 1.32
1mnc 3.96 0.69 0.36 1.95 1.53 0.60 0.36 1.38 1.21 0.60
1nhu 3.38 3.51 6.07 5.17 1.02 1.07 3.16 3.75 0.69 1.07
1nhv 3.26 4.68 6.57 8.95 1.35 1.76 5.96 4.45 1.04 1.76
1o86 3.46 1.25 1.06 1.85 1.80 1.25 0.97 0.99 1.54 1.25
1ppc 1.60 1.59 1.69 1.76 1.37 1.20 1.62 1.76 1.30 1.20
1pph 3.39 2.38 5.09 4.95 1.36 1.42 1.09 0.88 1.02 1.42
1qbu 0.97 0.72 10.36 2.59 0.86 0.66 10.36 2.59 0.86 0.66
1qhi 0.66 0.69 0.30 0.66 0.51 0.58 0.30 0.41 0.31 0.55
1qpe 0.63 0.67 1.50 0.52 0.44 0.47 0.52 0.34 0.25 0.45
1r09 5.99 5.95 0.82 1.81 1.85 1.50 0.82 0.53 0.49 1.21
Continued on next page
Detailed Results 110
Table C.1 continued from previous page
Top scoring pose Best RMSD
Top 20 All available
C
O
D
E
I
S
E
L
G
A
G
l
i
d
e
G
O
L
D
I
S
E
L
G
A
G
l
i
d
e
G
O
L
D
I
S
E
L
G
A
1thl 2.88 2.12 8.54 10.08 1.72 1.15 1.78 2.12 1.11 1.15
1uvt 0.85 0.60 0.44 1.47 0.66 0.49 0.44 0.54 0.66 0.49
1ydr 1.51 0.65 1.56 2.52 0.53 0.62 0.67 2.52 0.32 0.57
1yds 0.69 0.66 0.50 0.55 0.54 0.60 0.50 0.55 0.49 0.60
2cgr 0.79 0.85 0.85 6.54 0.62 0.73 0.67 6.35 0.62 0.66
2pcp 1 0.99 0.64 3.89 0.30 0.96 0.62 1.08 0.30 0.95
2qwi 0.56 0.71 0.70 1.30 0.37 0.60 0.70 0.96 0.37 0.51
3cpa 0.84 0.85 0.79 0.73 0.69 0.62 0.53 0.60 0.69 0.61
3erk 0.59 0.72 0.44 1.42 0.25 0.64 0.44 0.63 0.21 0.64
3ert 1.14 1.44 4.66 4.74 0.88 1.03 2.48 2.39 0.88 0.90
3std 0.60 0.56 2.44 0.85 0.40 0.48 2.44 0.85 0.39 0.35
3tmn 0.66 3.09 8.07 7.59 0.54 0.58 3.18 3.90 0.48 0.58
4dfr 1.10 1.01 1.27 1.20 0.74 0.81 1.10 1.18 0.72 0.81
5std 0.52 0.47 0.73 0.86 0.34 0.42 0.73 0.58 0.28 0.40
5tln 1.73 3.82 9.67 6.52 1.11 0.88 1.20 1.01 1.11 0.88
7est 0.84 0.79 1.02 3.76 0.75 0.63 0.82 0.87 0.75 0.63
966c 1.05 0.70 2.44 2.42 0.81 0.55 2.21 2.34 0.81 0.55
Table C.1: Detailed docking results of the exible ligand rigid protein
data set. RMSD[
A]
Detailed Results 111
C.2 Flexible ligand rigid protein docking energy
landscapes
Following are the energy vs RMSD plots for ISE-dock and AutoDock of
all the 81 complexes in the exible ligand - rigid protein docking set. The
graphs are sorted alphabetically according to the PDB code of the complex.
Detailed Results 112
F
i
g
u
r
e
C
.
1
:
E
n
e
r
g
y
v
s
R
M
S
D
p
l
o
t
s
f
o
r
I
S
E
-
d
o
c
k
(
r
e
d
)
a
n
d
A
u
t
o
D
o
c
k
(
g
r
e
e
n
)
o
f
c
o
m
p
l
e
x
e
s
i
n
t
h
e
e
x
i
b
l
e
l
i
g
a
n
d
-
r
i
g
i
d
p
r
o
t
e
i
n
d
o
c
k
i
n
g
s
e
t
.
T
h
e
g
r
a
p
h
s
a
r
e
s
o
r
t
e
d
a
l
p
h
a
b
e
t
i
c
a
l
l
y
a
c
c
o
r
d
i
n
g
t
o
t
h
e
P
D
B
c
o
d
e
o
f
t
h
e
c
o
m
p
l
e
x
.
C
o
n
t
i
n
u
e
d
o
n
t
h
e
f
o
l
l
o
w
i
n
g
g
u
r
e
s
.
Detailed Results 113
F
i
g
u
r
e
C
.
2
:
C
o
n
t
i
n
u
e
d
f
r
o
m
t
h
e
p
r
e
v
i
o
u
s
g
u
r
e
.
E
n
e
r
g
y
v
s
R
M
S
D
p
l
o
t
s
f
o
r
I
S
E
-
d
o
c
k
(
r
e
d
)
a
n
d
A
u
t
o
D
o
c
k
(
g
r
e
e
n
)
o
f
c
o
m
p
l
e
x
e
s
i
n
t
h
e
e
x
i
b
l
e
l
i
g
a
n
d
-
r
i
g
i
d
p
r
o
t
e
i
n
d
o
c
k
i
n
g
s
e
t
.
T
h
e
g
r
a
p
h
s
a
r
e
s
o
r
t
e
d
a
l
p
h
a
b
e
t
i
c
a
l
l
y
a
c
c
o
r
d
i
n
g
t
o
t
h
e
P
D
B
c
o
d
e
o
f
t
h
e
c
o
m
p
l
e
x
.
Detailed Results 114
F
i
g
u
r
e
C
.
3
:
C
o
n
t
i
n
u
e
d
f
r
o
m
t
h
e
p
r
e
v
i
o
u
s
g
u
r
e
.
E
n
e
r
g
y
v
s
R
M
S
D
p
l
o
t
s
f
o
r
I
S
E
-
d
o
c
k
(
r
e
d
)
a
n
d
A
u
t
o
D
o
c
k
(
g
r
e
e
n
)
o
f
c
o
m
p
l
e
x
e
s
i
n
t
h
e
e
x
i
b
l
e
l
i
g
a
n
d
-
r
i
g
i
d
p
r
o
t
e
i
n
d
o
c
k
i
n
g
s
e
t
.
T
h
e
g
r
a
p
h
s
a
r
e
s
o
r
t
e
d
a
l
p
h
a
b
e
t
i
c
a
l
l
y
a
c
c
o
r
d
i
n
g
t
o
t
h
e
P
D
B
c
o
d
e
o
f
t
h
e
c
o
m
p
l
e
x
.
Detailed Results 115
F
i
g
u
r
e
C
.
4
:
C
o
n
t
i
n
u
e
d
f
r
o
m
t
h
e
p
r
e
v
i
o
u
s
g
u
r
e
.
E
n
e
r
g
y
v
s
R
M
S
D
p
l
o
t
s
f
o
r
I
S
E
-
d
o
c
k
(
r
e
d
)
a
n
d
A
u
t
o
D
o
c
k
(
g
r
e
e
n
)
o
f
c
o
m
p
l
e
x
e
s
i
n
t
h
e
e
x
i
b
l
e
l
i
g
a
n
d
-
r
i
g
i
d
p
r
o
t
e
i
n
d
o
c
k
i
n
g
s
e
t
.
T
h
e
g
r
a
p
h
s
a
r
e
s
o
r
t
e
d
a
l
p
h
a
b
e
t
i
c
a
l
l
y
a
c
c
o
r
d
i
n
g
t
o
t
h
e
P
D
B
c
o
d
e
o
f
t
h
e
c
o
m
p
l
e
x
.
Detailed Results 116
F
i
g
u
r
e
C
.
5
:
C
o
n
t
i
n
u
e
d
f
r
o
m
t
h
e
p
r
e
v
i
o
u
s
g
u
r
e
.
E
n
e
r
g
y
v
s
R
M
S
D
p
l
o
t
s
f
o
r
I
S
E
-
d
o
c
k
(
r
e
d
)
a
n
d
A
u
t
o
D
o
c
k
(
g
r
e
e
n
)
o
f
c
o
m
p
l
e
x
e
s
i
n
t
h
e
e
x
i
b
l
e
l
i
g
a
n
d
-
r
i
g
i
d
p
r
o
t
e
i
n
d
o
c
k
i
n
g
s
e
t
.
T
h
e
g
r
a
p
h
s
a
r
e
s
o
r
t
e
d
a
l
p
h
a
b
e
t
i
c
a
l
l
y
a
c
c
o
r
d
i
n
g
t
o
t
h
e
P
D
B
c
o
d
e
o
f
t
h
e
c
o
m
p
l
e
x
.
Detailed Results 117
F
i
g
u
r
e
C
.
6
:
C
o
n
t
i
n
u
e
d
f
r
o
m
t
h
e
p
r
e
v
i
o
u
s
g
u
r
e
.
E
n
e
r
g
y
v
s
R
M
S
D
p
l
o
t
s
f
o
r
I
S
E
-
d
o
c
k
(
r
e
d
)
a
n
d
A
u
t
o
D
o
c
k
(
g
r
e
e
n
)
o
f
c
o
m
p
l
e
x
e
s
i
n
t
h
e
e
x
i
b
l
e
l
i
g
a
n
d
-
r
i
g
i
d
p
r
o
t
e
i
n
d
o
c
k
i
n
g
s
e
t
.
T
h
e
g
r
a
p
h
s
a
r
e
s
o
r
t
e
d
a
l
p
h
a
b
e
t
i
c
a
l
l
y
a
c
c
o
r
d
i
n
g
t
o
t
h
e
P
D
B
c
o
d
e
o
f
t
h
e
c
o
m
p
l
e
x
.
Detailed Results 118
Figure C.7: Continued from the previous gure. Energy vs RMSD plots for ISE-dock
(red) and AutoDock (green) of complexes in the exible ligand - rigid protein docking
set. The graphs are sorted alphabetically according to the PDB code of the complex.
Appendix D
Flexible ligand exible protein
docking. Trypsin data set
Table D.1
Ligand only atoms All movable atoms
ligand protein top1 top20 top4096 top1 top20 top4096
1ppc 1ppc 1.72 0.87 0.87 1.84 1.26 1.23
1ppc 1pph 3.38 2.48 1.42 2.8 2.17 1.6
1ppc 1tng 2.84 1.44 1.27 2.59 1.48 1.48
1ppc 1tnh 3.02 1.91 1.59 2.56 1.83 1.6
1ppc 1tni 2.59 1.53 1.08 2.3 1.53 1.31
1ppc 1tnj 1.99 1.3 1.25 2.21 1.73 1.64
1ppc 1tnk 2.73 1.13 1.02 2.48 1.41 1.4
1ppc 1tnl 3.05 1.34 1.34 2.85 1.74 1.65
1ppc 1tpp 3.44 1.6 1.33 3.02 2.01 1.69
1ppc 3ptb 2.49 1.69 1.45 2.4 1.94 1.84
1pph 1ppc 3.86 2.38 2.04 3.86 2.38 2.04
1pph 1pph 4.66 2.14 1.7 4.66 2.14 1.7
1pph 1tng 4.56 1.97 1.69 4.56 1.97 1.69
1pph 1tnh 4.31 2.21 1.74 4.31 2.21 1.74
1pph 1tni 4.5 1.99 1.57 4.5 1.99 1.57
1pph 1tnj 3.88 2.01 1.59 3.88 2.01 1.59
1pph 1tnk 4.27 2.7 1.79 4.27 2.7 1.79
Continued on next page
119
Flexible ligand exible protein docking. Trypsin data set 120
Table D.1 continued from previous page
Ligand only atoms All movable atoms
ligand protein top1 top20 top4096 top1 top20 top4096
1pph 1tnl 2.77 1.96 1.42 2.77 1.96 1.42
1pph 1tpp 4.46 2.51 1.9 4.46 2.51 1.9
1pph 3ptb 3.57 2.22 1.76 3.57 2.22 1.76
1tng 1ppc 0.97 0.64 0.43 1.85 1.55 1.03
1tng 1pph 1.12 0.99 0.96 1.92 1.75 1.6
1tng 1tng 0.53 0.43 0.28 1.58 1.27 0.89
1tng 1tnh 0.64 0.54 0.4 1.69 1.32 1.09
1tng 1tni 0.99 0.63 0.63 1.9 1.47 1.25
1tng 1tnj 0.77 0.54 0.42 2.07 1.4 1.01
1tng 1tnk 0.9 0.59 0.55 1.91 1.29 1.08
1tng 1tnl 0.62 0.5 0.38 1.28 1.03 0.85
1tng 1tpp 1.04 0.61 0.53 2.34 1.75 1.58
1tng 3ptb 1 0.78 0.66 2.25 2.04 1.86
1tnh 1ppc 3.36 1.56 1.3 3.36 1.56 1.3
1tnh 1pph 4.36 1.5 1.08 4.36 1.5 1.08
1tnh 1tng 2.82 1.36 1.15 2.82 1.36 1.15
1tnh 1tnh 3.39 1.4 1.18 3.39 1.4 1.18
1tnh 1tni 2.56 1.41 1.17 2.56 1.41 1.17
1tnh 1tnj 2.08 1.31 1.11 2.08 1.31 1.11
1tnh 1tnk 3.51 1.43 1.09 3.51 1.43 1.09
1tnh 1tnl 2.07 1.41 1.23 2.07 1.41 1.23
1tnh 1tpp 2.12 1.45 1.21 2.12 1.45 1.21
1tnh 3ptb 2.78 1.82 1.27 2.78 1.82 1.27
1tni 1ppc 2.95 2 1.29 2.95 2 1.29
1tni 1pph 3.09 1.85 1.39 3.09 1.85 1.39
1tni 1tng 2.83 1.6 1.19 2.83 1.6 1.19
1tni 1tnh 2.32 1.68 1.35 2.32 1.68 1.35
1tni 1tni 2.53 1.71 1.3 2.53 1.71 1.3
1tni 1tnj 2.31 1.81 1.2 2.31 1.81 1.2
1tni 1tnk 4.12 1.64 1.2 4.12 1.64 1.2
Continued on next page
Flexible ligand exible protein docking. Trypsin data set 121
Table D.1 continued from previous page
Ligand only atoms All movable atoms
ligand protein top1 top20 top4096 top1 top20 top4096
1tni 1tnl 3.81 1.5 1.14 3.81 1.5 1.14
1tni 1tpp 4.12 1.89 1.5 4.12 1.89 1.5
1tni 3ptb 2.66 1.85 1.22 2.66 1.97 1.22
1tnj 1ppc 3.99 2.17 1.27 3.99 2.17 1.27
1tnj 1pph 3.24 1.84 1.35 3.24 1.84 1.35
1tnj 1tng 3.85 1.48 1.15 3.85 1.49 1.15
1tnj 1tnh 2.5 1.49 1.19 2.5 1.49 1.19
1tnj 1tni 3.25 1.68 1.29 3.25 1.68 1.29
1tnj 1tnj 2.34 1.44 1.14 2.34 1.44 1.14
1tnj 1tnk 2.77 1.43 1.07 2.77 1.43 1.07
1tnj 1tnl 2.46 1.49 1.04 2.46 1.49 1.04
1tnj 1tpp 2.2 1.72 1.3 2.2 1.72 1.3
1tnj 3ptb 3.53 1.67 1.22 3.53 1.67 1.22
1tnk 1ppc 4.74 1.95 1.62 4.74 1.95 1.62
1tnk 1pph 4.28 1.79 1.42 4.28 1.79 1.42
1tnk 1tng 2.9 1.66 1.39 2.9 1.66 1.39
1tnk 1tnh 3.62 1.56 1.17 3.62 1.56 1.17
1tnk 1tni 2.66 1.61 1.42 2.66 1.61 1.42
1tnk 1tnj 2.6 1.59 1.28 2.6 1.59 1.28
1tnk 1tnk 3.52 1.41 1.25 3.52 1.41 1.25
1tnk 1tnl 4.44 1.73 1.43 4.44 1.73 1.43
1tnk 1tpp 3.67 1.75 1.45 3.67 1.75 1.45
1tnk 3ptb 3.09 1.65 1.37 3.09 1.65 1.37
1tnl 1ppc 4.46 1.86 1.12 4.46 1.86 1.12
1tnl 1pph 2.1 1.5 1.21 2.1 1.5 1.21
1tnl 1tng 3.05 1.34 1.14 3.05 1.34 1.14
1tnl 1tnh 2.56 1.33 1.18 2.56 1.33 1.18
1tnl 1tni 2.67 1.34 1.07 2.67 1.34 1.07
1tnl 1tnj 1.78 1.33 1.23 1.78 1.33 1.23
1tnl 1tnk 2.72 1.38 1.1 2.72 1.38 1.1
Continued on next page
Flexible ligand exible protein docking. Trypsin data set 122
Table D.1 continued from previous page
Ligand only atoms All movable atoms
ligand protein top1 top20 top4096 top1 top20 top4096
1tnl 1tnl 3.29 1.3 1.03 3.29 1.3 1.03
1tnl 1tpp 2.06 1.37 1.37 2.06 1.37 1.37
1tnl 3ptb 3.39 1.38 1.21 3.39 1.38 1.21
1tpp 1ppc 4.85 3.09 2.06 4.84 3.09 2.06
1tpp 1pph 5.58 2.73 1.94 5.58 2.73 1.94
1tpp 1tng 5.15 4.5 2.2 5.15 4.5 2.2
1tpp 1tnh 4.56 3.44 1.77 4.56 3.44 1.77
1tpp 1tni 4.61 4.39 2.49 4.61 4.39 2.49
1tpp 1tnj 4.5 3.54 1.96 4.5 3.54 1.96
1tpp 1tnk 5.53 4.06 1.99 5.53 4.06 1.99
1tpp 1tnl 5.19 4.11 2.74 5.19 4.11 2.74
1tpp 1tpp 4.23 2.61 1.78 4.23 2.61 1.78
1tpp 3ptb 5.97 3.99 2.82 5.97 3.99 2.82
3ptb 1ppc 3.04 1.75 0.98 3.04 1.75 0.98
3ptb 1pph 3.28 1.61 1.11 3.28 1.61 1.11
3ptb 1tng 3.16 2.59 0.97 3.16 2.59 0.97
3ptb 1tnh 3.7 2.3 1.18 3.7 2.3 1.18
3ptb 1tni 3.17 2.35 0.93 3.17 2.35 0.93
3ptb 1tnj 2.7 2.32 1.37 2.7 2.32 1.37
3ptb 1tnk 3.49 1.95 1.2 3.49 1.95 1.2
3ptb 1tnl 3.09 2.38 1.09 3.09 2.38 1.09
3ptb 1tpp 3.14 1.94 0.77 3.14 1.94 0.77
3ptb 3ptb 2.8 2.22 1.24 2.8 2.22 1.24
Table D.1: RMSD [
and C
atoms on
the receptor molecule overlap with their respective dummy
counterparts. . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.2 Structural alignment of 456c and 966c. Backbone traces of
the proteins are color coded according to the distance (in
A)
between the aligned backbone atoms. RS-130830 (red) and
RS-104966 (green) are shown as sticks models. . . . . . . . . . 58
123
List of Figures 124
2.3 Cross section of AChE complexed with acetylcholine (PDB
code: 2ace), colored by (A) partial charge of the atoms and
(B) by the residue type (colored by PyMol): hydrophobic
(GILMPV) white, aromatic (FWY) magenta, semipolar
(C) yellow, polar (HNQST) cyan, positive (KR) blue,
negative (DE) red. Acetylcholine is colored blue in both
panes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.4 AChE complexed with Huperzine A (PDB code: 1vot, light
gray) and with Aricept (PDB code: 1eve, dark gray). The
ligands and Phe 330 side chains from both the complexes are
highlighted using sticks. . . . . . . . . . . . . . . . . . . . . . 61
2.5 Trypsin data set. 10 superimposed trypsin structures: 1ppc,
1pph, 1tng, 1tnh, 1tni, 1tnj, 1tnk, 1tnl, 1tpp and 3ptb. The
ligand molecules and the residues that are treated as exible
are shown as sticks. The remaining parts of the proteins are
shown as backbone trace. . . . . . . . . . . . . . . . . . . . . . 63
3.1 Top single docking poses at dierent RMSD bins with respect
to crystal structures, 4 dierent programs. Results for Glide
and GOLD were obtained by Perola et al.[84]. . . . . . . . . . 67
3.2 Top 20 docking poses, RMSD to corresponding crystal struc-
tures. Results for Glide and GOLD were obtained by Perola
et al.[84]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.3 Top available docking poses produced in equal CPU times,
RMSD to corresponding crystal structures. The numbers of
poses are 4096 (ISE) and 35 (LGA). . . . . . . . . . . . . . . . 71
3.4 Number of iterations before switching to exhaustive search
as a function of initial combinatorial size (number of initial
combinations). . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.5 A: Energy vs RMSD plot for docking populations of the com-
plex 1yds obtained with ISE, showing a single distinct funnel.
B: the same plot for 35 solutions obtained by LGA. The plots
are shown using the same scale. C: The rst 35 solutions (dark
lines) docked by ISE vs the ligand in the crystal (gray sticks).
Receptor residues with at least one atom within 5.5
A of the
ligand are shown as light gray cartoon. All structures in this
work were visualized using PyMol[15]. . . . . . . . . . . . . . . 74
List of Figures 125
3.6 A: Energy vs RMSD plot for docking populations of the com-
plex 1bqo obtained with ISE, showing two distinct funnels. B:
the same plot for 35 solutions obtained by LGA. The plots
are shown using the same scale. C: The crystal structure of
the ligand (gray sticks) and the rst 35 solutions (dark lines)
docked by ISE. . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.7 A: Energy vs RMSD plot for docking populations of the com-
plex 1hpv obtained with ISE, showing a scatter of the results.
B: the same plot for 35 solutions obtained by LGA. The plots
are shown using the same scale. C: The crystal structure of
the ligand and the rst 35 solutions docked by ISE. . . . . . . 76
3.8 Cumulative fractions (Y-axis) of 81 ISE docking complexes
with an energy span between the global minimum of each (pose
number 1) and the other 4095 poses, below the given threshold
(X-axis). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.9 Complexes 1kv1 (light gray) and 1kv2 (dark gray) superim-
posed using backbone atoms. The ligands are shown as sticks
and backbone of closest (within 5.5
A) residues to the ligand
are shown as PyMol cartoons. . . . . . . . . . . . . . . . . . 79
3.10 Energy vs RMSD plot for docking populations obtained by
ISE (A) and LGA (B) of the complex 1kv1. The plots are
shown using the same scale. The best single ISE solutions at
each of the three funnels have ranks 1, 222 and 270 and are
marked with arrows. . . . . . . . . . . . . . . . . . . . . . . . 80
3.11 The best ISE-dock solution for 1kv1 (sticks). The crystal
structures of 1kv1 and 1kv2 ligands are shown for compari-
son (lines). 1kv1 is colored according to: C cyan, N blue,
Cl green. 1kv2 is colored according to: C yellow, N blue,
O red. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.12 ISE-dock solution for 1kv1, ranked 222 (sticks). The crystal
structures of 1kv1 and 1kv2 ligands are shown for comparison
(lines). The coloring scheme is identical to that of Figure 3.11 81
3.13 ISE-dock solution for 1kv1 solution ranked 270 (sticks). The
crystal structures of 1kv1 and 1kv2 ligands are shown for com-
parison (lines). The coloring scheme is identical to that on
Figure 3.11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
List of Figures 126
3.14 Energy vs. RMSD plot for docking populations of the complex
2rox, obtained by ISE (A) and LGA(B). The best single ISE
solutions at each of the two funnels have ranks 1 and 2 and are
marked with arrows. C: Antiparallel docking solutions ranked
1 and 2 for 2rox (green and magenta sticks respectively). The
carbons in the crystal structure of thyroxine are shown thin
sticks colored cyan. The backbone of closest (within 5.5
A)
residues to the ligand are shown in PyMol cartoon represen-
tation colored cyan. . . . . . . . . . . . . . . . . . . . . . . . . 83
4.1 The best available docking solution for (A) 1eve-1vot and (B)
1vot-1eve in unbound (cross-) docking experiments. The dock-
ing solutions for all the movable atoms are shown as lines and
the crystal structures are shown as sticks. The protein struc-
tures are shown as backbone trace. . . . . . . . . . . . . . . . 89
4.2 The best available docking solution for (A) 1eve-1eve and (B)
1vot-1vot in bound docking experiments. The docking solu-
tions for all the movable atoms are shown as lines and the
crystal structures are shown as sticks. The protein structures
are shown as backbone trace. . . . . . . . . . . . . . . . . . . 90
4.3 Top docking poses at dierent RMSD bins with respect to
crystal structures . . . . . . . . . . . . . . . . . . . . . . . . . 91
C.1 Energy vs RMSD plots for ISE-dock (red) and AutoDock
(green) of complexes in the exible ligand - rigid protein dock-
ing set. The graphs are sorted alphabetically according to the
PDB code of the complex. Continued on the following gures. 112
C.2 Continued from the previous gure. Energy vs RMSD plots
for ISE-dock (red) and AutoDock (green) of complexes in
the exible ligand - rigid protein docking set. The graphs are
sorted alphabetically according to the PDB code of the complex.113
C.3 Continued from the previous gure. Energy vs RMSD plots
for ISE-dock (red) and AutoDock (green) of complexes in
the exible ligand - rigid protein docking set. The graphs are
sorted alphabetically according to the PDB code of the complex.114
C.4 Continued from the previous gure. Energy vs RMSD plots
for ISE-dock (red) and AutoDock (green) of complexes in
the exible ligand - rigid protein docking set. The graphs are
sorted alphabetically according to the PDB code of the complex.115
List of Figures 127
C.5 Continued from the previous gure. Energy vs RMSD plots
for ISE-dock (red) and AutoDock (green) of complexes in
the exible ligand - rigid protein docking set. The graphs are
sorted alphabetically according to the PDB code of the complex.116
C.6 Continued from the previous gure. Energy vs RMSD plots
for ISE-dock (red) and AutoDock (green) of complexes in
the exible ligand - rigid protein docking set. The graphs are
sorted alphabetically according to the PDB code of the complex.117
C.7 Continued from the previous gure. Energy vs RMSD plots
for ISE-dock (red) and AutoDock (green) of complexes in
the exible ligand - rigid protein docking set. The graphs are
sorted alphabetically according to the PDB code of the complex.118
List of Tables
2.1 PDB codes of the 81 complexes in the rigid protein test set. . 51
2.2 Anities to collagenase . . . . . . . . . . . . . . . . . . . . . . 57
3.1 Summary of docking results by ISE, LGA, Glide and GOLD. . 65
3.2 Binding modes of 1-(5-tert-butyl-2-methyl-2h-pyrazol-3-yl)- 3-
(4-chloro-phenyl)-urea (from 1kv1) . . . . . . . . . . . . . . . 81
4.1 Collagenase data set, best ligands RMSD (
A) in top 1, top
20 and all available (4096) solutions. RMSD of the backbone
from the crystal position of the corresponding solution is also
reported. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.2 Results of Acetylcholinesterase cross docking . . . . . . . . . . 88
4.3 Torsion RMSD of exible residues in the trypsin data set . . . 91
4.4 Trypsin data set, RMSD values of top single docking poses
and best docking poses in top 20 and top 4096 solutions . . . 93
4.5 Current status of protein exibility handling ISE-dock and in
ve popular docking programs (sorted according to the num-
ber of citations in 2005[95]) . . . . . . . . . . . . . . . . . . . 94
C.1 Detailed docking results of the exible ligand rigid protein
data set. RMSD[
A] . . . . . . . . . . . . . . . . . . . . . . . . 110
D.1 RMSD [