Anda di halaman 1dari 149

Predicting Protein-Ligand

Interactions using Iterative


Stochastic Elimination Algorithm
Boris Gorelik, B.Pharm, M.Sc
A dissertation submitted to the Hebrew University of Jerulalem
for the degree of Doctor of Philosophy
2007
i
This work was conducted under the supervision of
professor Amiram Goldbum
ii
Abstract
Molecular Docking is an in silico process for predicting the structure of
receptor-ligand complexes. Such a prediction is of great importance in vari-
ous elds of life sciences, mainly in drug design eorts. Numerous methods
for solving this problem have been developed, employing a plethora of algo-
rithms.
Four main diculties aect the docking algorithms the vast space that
needs to be searched, the scoring of ligand poses, the exibility of both
partners protein and ligand and introducing water molecules that frequently
mediate intermolecular interactions.
This work presents ISE-dock a docking program which is based on
the Iterative Stochastic Elimination (ISE) algorithm. ISE is a generic opti-
mization algorithm that is based on elimination of values that consistently
lead to the worst results. It constructs large sets of near optimal solutions
with no additional computational cost compared to producing single poses.
ISE-dock is based on the source code of AutoDock v.3.0.5 and uses its
scoring function. Development of a scoring function is beyond the scope of
this work. Unlike the original AutoDock program, ISE-dock is capable of
dealing with conformational changes of the receptor. The changes in the re-
ceptors backbone are implemented in an implicit, multistep way. Using this
approach, multiple structures of the protein are generated by an ISE-based
program. The resulting structures serve as a target for subsequent docking of
the ligand. Explicit handling of changes in the protein 3D structure is made
possible by tearing o side-chain atoms from the protein. Thus, movable
iii
protein atoms are treated as a part of the ligand. In the current version of
ISE-dock, such a handling of protein exibility is limited to unconstrained
rotations of protein side chain atoms. Although not done in this work, it is
possible to use rotamer libraries to decrease the complexity of the problem,
thus the experiments presented in here represent the worst case scenario
in terms of side chain exibility.
ISE algorithm begins by constructing a matrix that contains a set of the
possible (discrete) values for each degree of freedom (variable) that denes
the problem (system). If the problem is prediction of molecular conformation
and the degrees of freedom are rotatable bonds, then the angular rotations
around each bond are its discrete variables. One value is picked randomly out
of the set of each variable, to determine the full conguration (conformation)
of the system, which is then evaluated by a scoring function. This step
is repeated many times to form a large sample, usually in the 10
4
10
6
range. The scores of that sample are arranged in a virtual histogram in which
only a small fraction (1%-10%) of worst and of best results are examined in
detail, to assess the contribution of each and every variable value on the nal
scores. A value that appears in the worst results with a signicantly higher
frequency than expected from its random distribution (based on its total
appearance in the full sample) or one that appears with a signicantly lower
frequency than expected among the best results, is marked for elimination.
The next iteration of random picking, scoring, sampling and eliminating thus
begins with a smaller number of possible combinations. The elimination
process is performed iteratively until the number of possible conformations
enables exhaustive search in feasible time. Additions in this work to previous
applications of ISE include:
Local optimization of a randomly picked fraction of the sampled struc-
tures during the stochastic search (as performed by LGA in Auto-
Dock [Morris et al. J Comp Chem 1998, 19, 163962] ). During the
nal, exhaustive search step, any screened conformation has a proba-
bility of 60% to undergo optimization. The main purpose of the local
optimization step is to solve clashes and unfavorable conformations that
are caused by the discrete nature of the algorithm.
iv
Only a limited portion of the values may be discarded for any given
variable in any iteration.
Keeping and updating a list of best encountered conformations. The
size of the list is user dened.
ISE-dock was validated using four independent data sets. Flexible lig-
and rigid protein docking was validated using 81 protein-ligand complexes
from the PDB and ISE-dock performance was compared to those of Glide,
GOLD and AutoDock. Flexible ligand exible protein docking was
tested using three additional data sets: collagenase (backbone exibility, 2
complexes), Acetylcholine Esterase AChE (exibility of a single side chain,
2 complexes) and trypsin (exibility of several side chains, 10 complexes).
When no protein exibility is allowed, ISE-dock has a better chance
than the other three to nd more than 60% top single poses under RMSD=2.0

A
and more than 80% under RMSD=3.0

A from experimental. ISE alone pro-


duced at least one 3.0

A or better solutions among the top 20 poses in the


entire test set. In 98% of the examined molecules, ISE produced solutions
that are closer than 2.0

A from experimental. Paired t-tests (PTT) were used


throughout to assess the signicance of comparisons between the performance
of the dierent programs. ISE-dock provides more than a 100-fold docking
solutions in a similar time frame as LGA in AutoDock. The usefulness of
the large near optimal populations of ligand poses is demonstrated by show-
ing a correlation between the docking results and experiments that support
multiple binding modes in p38 MAP kinase [Pargellis, C. et al. Nat Struct
Biol 2002, 9, (4), 26872] and in Human Transthyretin [Hamilton JA, Benson
MD. Cell Mol Life Sci 2001; 58(10):14911521].
Introduction of partial handling of protein exibility into ISE-dock re-
quires several changes to the original scoring function, which has a strong
impact on the quality of the top ranked solutions. Nevertheless, the entire
docking solutions in this work always contain ligand poses of reasonable to
very high quality.
Docking of a exible ligand into a protein while partially unfreazing the
backbone was tested on two collagenase-inhibitor complexes from the PDB
v
(PDB codes: 456c, 966c). In this case, the bound docking solutions contain
ligand poses with reasonably low RMSD values of 1.33

A (456c) and 1.18

A
(966c).
Two structures of AChE (4 cross docking experiments) and 10 struc-
tures of trypsin (100 cross docking experiments) with their respective in-
hibitors demonstrate the capabilities of ISE-dock to deal with protein side
chain exibility. In both cases, high quality docking solutions are obtained
in terms of RMSD of all movable atoms from their experimental positions.
Docking populations for AChE contain solutions with RMSD0.37

A, and
in the worst case, RMSD0.85

A. In 74 (out of 100) cases in the trypsin


data set, the top 20 docking solutions contain poses with RMSD<2.0

A. In
94 cases, the entire docking sets contain solutions with RMSD<2.0

A and all
docking sets contain solutions with RMSD<3.0

A.
This work shows that ISE-dock is superior in many aspects to the cur-
rently well established docking programs Glide, GOLD and AutoDock
in exible ligand rigid protein docking. It has been also shown that ISE-
dock deals successfully with various degrees of protein exibility. In order
to handle exible proteins in full extent, the scoring scheme needs to be
redesigned. The latter task is beyond the scope of this work.
Protein exibility is an important aspect of a protein-ligand docking pro-
gram. Other degrees of freedom that were not accounted for in this work,
but that can be introduced into ISE-dock relatively easily are modeling
of structurally important water molecules and protonation and tautomeric
states of the interacting molecules.
Contents
1 Introduction 1
1.1 Current drug discovery process . . . . . . . . . . . . . . . . . 1
1.2 Flexibility in molecular interactions . . . . . . . . . . . . . . . 5
1.3 Energy and thermodynamic potentials . . . . . . . . . . . . . 7
1.4 Common energy components . . . . . . . . . . . . . . . . . . . 12
1.5 Force elds and scoring functions . . . . . . . . . . . . . . . . 19
1.5.1 Force eld based energy functions . . . . . . . . . . . . 20
1.5.2 Approximate energy functions . . . . . . . . . . . . . . 21
1.5.3 Statistical potentials . . . . . . . . . . . . . . . . . . . 22
1.5.4 Geometric and chemical complementarity functions . . 23
1.6 Energy funnels . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.7 Multiple binding modes . . . . . . . . . . . . . . . . . . . . . 25
1.8 Docking techniques . . . . . . . . . . . . . . . . . . . . . . . . 26
1.8.1 Flexibility in docking programs . . . . . . . . . . . . . 26
1.8.2 Search algorithms . . . . . . . . . . . . . . . . . . . . . 29
1.8.3 Evaluating docking programs . . . . . . . . . . . . . . 32
1.9 Open problems and issues . . . . . . . . . . . . . . . . . . . . 34
2 Methods 35
2.1 Energy function . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.2 AutoDock docking program . . . . . . . . . . . . . . . . . . 37
2.2.1 Lamarckian Genetic Algorithm . . . . . . . . . . . . . 38
2.2.2 Problem representation . . . . . . . . . . . . . . . . . . 41
2.3 ISE-dock program . . . . . . . . . . . . . . . . . . . . . . . . 42
vi
Contents vii
2.3.1 Iterative Stochastic Elimination algorithm . . . . . . . 43
2.3.2 Problem representation . . . . . . . . . . . . . . . . . . 46
2.3.3 Protein exibility . . . . . . . . . . . . . . . . . . . . . 46
2.4 Rigid protein docking . . . . . . . . . . . . . . . . . . . . . . . 49
2.4.1 LGA docking . . . . . . . . . . . . . . . . . . . . . . . 49
2.4.2 The data set . . . . . . . . . . . . . . . . . . . . . . . . 50
2.4.3 Comparisons and their analysis . . . . . . . . . . . . . 51
2.4.4 Paired t-test . . . . . . . . . . . . . . . . . . . . . . . . 52
2.4.5 Comparing CPU time . . . . . . . . . . . . . . . . . . 53
2.4.6 Energy funnels . . . . . . . . . . . . . . . . . . . . . . 53
2.5 Flexible protein docking . . . . . . . . . . . . . . . . . . . . . 54
2.5.1 Protein backbone Flexibility . . . . . . . . . . . . . . . 56
2.5.2 Flexibility of a single side chain . . . . . . . . . . . . . 59
2.5.3 Flexibility of several side chains . . . . . . . . . . . . . 62
2.5.4 Comparisons and their analysis . . . . . . . . . . . . . 63
3 Flexible ligand rigid protein docking 64
3.1 Top scoring poses . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.2 Top 20 poses . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.3 Solution space coverage . . . . . . . . . . . . . . . . . . . . . . 69
3.4 Time performance . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.5 Multiple binding modes . . . . . . . . . . . . . . . . . . . . . 73
3.6 PDB data supports distinct funnels . . . . . . . . . . . . . . . 78
4 Flexible Ligand Flexible Protein Docking 84
4.1 Protein backbone exibility . . . . . . . . . . . . . . . . . . . 84
4.2 Flexibility of a single side chain . . . . . . . . . . . . . . . . . 87
4.3 Flexibility of several side chains . . . . . . . . . . . . . . . . . 90
4.4 Discussion on protein exibility . . . . . . . . . . . . . . . . . 94
5 Conclusions 97
Appendices (submitted separately) 100
Contents viii
A Results published in a peer reviewed journal 101
B ISE-dock and AutoDock parameters and their values 103
B.1 AutoDock parameters and their
default values . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
B.2 ISE-dock parameters and their
default values . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
C Detailed Results 107
C.1 Flexible Ligand Rigid Protein docking results results . . . . 107
C.2 Flexible ligand rigid protein docking energy landscapes . . . 111
D Flexible ligand exible protein docking. Trypsin data set 119
List of Figures 123
List of Tables 128
Acknowledgments 129
Bibliography 130
Hebrew abstract 140
Chapter 1
Introduction
1.1 Current drug discovery process
Since the dawn of history, humankind has been searching for ways to ght
diseases and improve the quality of life. Modern science has undergone
tremendous developments and has successfully developed a great variety of
medicines. Nevertheless, the constant search for better drugs that reduce
side eects, cure more diseases, and extend life expectancy and quality has
never stopped. Drugs have traditionally been discovered by experimental
methods, but more recently, computerized (virtual) drug discovery methods
have been devised and prove to be helpful in the process of drug discovery
and in designing drugs. Figure 1.1 presents an overview of current methods
for designing drugs and discovering them. Roughly, the systematic search
for new active molecules can be divided in three categories: classical chem-
istry drug discovery, high trhoughput screening and virtual high throughput
screening.
1
Introduction 2
Figure 1.1: Schematic diagram of the main methods in the drug discovery process. Arrows
designate process ow. Black asterisks mark steps that may involve molecular docking.
Abbreviations: SAR structure-activity relationship; QSAR quantitative SAR; ADME-
Tox absorption, distribution, elimination, toxicity
Classical chemistry drug discovery During the classical drug design pro-
cess, medicinal chemists use their personal experience, combined with ratio-
nalizing the knowledge of active compounds and the suspected drug target.
The process involves iterations of data evaluation, synthesis and purica-
tion, and assessment of biological activity. Only a few compounds can be
processed simultaneously using this approach. This approach is still labor-
intensive, slow, and expensive, requiring costly materials and techniques.
High throughput screening In several large and medium sized Pharma
companies, high throughput screening (HTS) techniques, by robotically scan-
ning the activities of hundreds of thousands of compounds has become a
Introduction 3
major method. The targets for screening can be single molecules, colonies
of bacteria, fungi, or animal cells. In this kind of experiments, the eect
is recorded using fast and, sometimes, non-specic parameters such as color
change, conductivity of electric current, particle count, etc. HTS experiments
are frequently conducted without exact knowledge about the target structure
or about the mechanism of action. While faster than the rst approach, HTS
often suers from ambiguity during the process of results interpretation and
still may require expensive materials and equipment.
Virtual high throughput screening (V-HTS) In order to save time and
reduce costs, virtual HTS is designed to mimic the HTS task in silico and
is expected to indicate which compounds are worth testing in wet exper-
iments. Instead of screening real compounds against real targets, virtual
computer libraries of existing and not yet existing chemicals are used. Nat-
urally, this process is much cheaper and, usually, faster than the two former
ones. On the other hand, the V-HTS process relies heavily on the con-
struction and validation of the underlying computational methods and on
the interpretation of the results. Availability of fast, validated and accurate
computational screening methods is, usually, the major bottle neck of the
V-HTS approach. The main tool of V-HTS is molecular docking, in which
a ligand or potential drugs is driven in order to nd a good parking place
on the biological target.
Docking programs are computational tools that model the structure and
the nature (anity) of molecular complexes. These programs aim to predict
geometry of inter- and intra-molecular interactions and to rank the various
Introduction 4
possibilities. The main advantage of computational techniques in general,
and of docking programs particularly, is that they are much cheaper and
faster than the corresponding wet techniques. Docking programs are used
as a primary screening tool during the virtual high throughput process. They
also assist biologists, biochemists, and medicinal chemists in designing novel
molecules and in interpreting experiments that assess the activity of already
existing ones.
Two main goals of docking tools are (1) to assist in designing novel chemi-
cal compounds, and (2) to study the nature of interactions between biological
targets and ligands. These may include endogenous molecules such as hor-
mones, or external ones such as drugs or toxic compounds.
Every docking program requires that the three dimensional (3D) struc-
ture of the target molecule be known to some extent. Protein Data Bank
(PDB)[5] is a publicly available repository that contains more than 42,000 3D
structures of biological macromolecules resolved with various degrees of res-
olution. Since 1999, the U.S. National Institute of General Medical Sciences
(NIGMS) has sponsored a large scale project called the Protein Structure
Initiative (PSI)[75]. The main goal of this initiative is to enlarge the number
of solved 3D structures of proteins, which would enable better coverage of
the existing drug targets and the discovery of new ones. Since the estab-
lishment of the PSI, the project has yielded more than 1,800 solved protein
structures (as of June 2006), with current estimated rate of more than 500
solved structures per year[67].
Despite the progress that the eld of docking has undergone in the last
few years, several problems still exist. One of the major problems is energy
Introduction 5
calculation. Another major problem is accounting for the many degrees of
freedom of the docking problem. These include exibility of the molecules,
protonation and tautomeric states etc. Considering all these degrees of free-
dom results in a tremendous combinatorial space that each docking tool has
to search. Due the exible nature of molecules, it is important not to limit
the scope of the docking solution to a single structure, but instead, to predict
collection (ensemble) of low energy multiple conformations that contribute
to the biological activity.
In this work, I present ISE-dock a protein ligand docking tool that
successfully overcomes the huge combinatorial space problem, while account-
ing for ligand and, to a lesser extent, protein exibility, and that is capable of
producing arbitrary large docking populations without substantial extension
of CPU time.
1.2 Flexibility in molecular interactions
Since 1894, when Emil Fischer proposed the famous lock and key model[21],
the perception of the nature of binding between biological molecules has un-
dergone several changes. Although evidence that support the lock and key
model exists (see for example:[6, 55, 22]), two models that are considered
to represent the majority of receptor-ligand interactions are induced t[48]
and equilibrium of multiple pre-existing conformations[63, 52, 76]. Induced
t theory assumes that the conformation of the target and ligand aect each
other as they approach an encounter. The conformation of the nal complex
may not be derived directly from the conformation of the separate molecules.
Introduction 6
Pre existing conformations assumes that the nal target and ligand con-
formations are already probed by the isolated molecules but, they could be
of much higher energy than the most abundant conformation and therefore,
their accessibility is minute in the absence of the partner. It is not uncom-
mon that the most populated unbound states of a protein are not those that
are most populated in the bound structure[97, 10]. The same notion is true
for ligands: it was found[83] that ligands rarely bind their receptors in the
calculated global minimum conformation. Moreover, in 60% of the cases, the
bound ligand is not found even in its local energy minimum with at least
10% of the examined ligands bind with strain energies over 9 kcal/mol.
Many theoretical and experimental studies support either the induced t
or pre-existing populations models in dierent cases of binding[37, 55, 63].
From a thermodynamic point of view, the two models are equivalent, how-
ever, describing biological systems in terms of pre-existing populations and
conformational selection is more useful in the process of drug discovery[97].
Regardless of which of the two models more accurately describes the na-
ture of binding, it is clear that molecular exibility is involved in complex
formation.
The process of binding may result in either increase or decrease of exibil-
ity. Decreased exibility may be attributed to enthalpy-entropy compensa-
tion, when more eective binding interactions are gained by freezing motion.
On the other hand, complex formation may be stabilized by entropic contri-
bution, associated with increased exibility[116]. It has been suggested[101]
that in 13 dierent MHC receptor-peptide complexes, the exibility is asso-
ciated with as much as 50% of free energy of binding.
Introduction 7
Flexibility plays an important role not only in complex formation, but also
in the mechanism of action of various complexes. For example, the conforma-
tional changes of several enzymes are very important for their activity[10, 40,
51]. Solved structures of protein-ligand complexes frequently show complexes
with 70 100% of the ligands surface area buried. Clearly, this kind of con-
formations could not be achieved without at least a minimal degree of protein
exibility. Works that analyze bound and apo-proteins show that although
there are complexes, where the protein undergoes almost no change upon
ligand binding[41], proteins that bind small molecules are usually subjected
to conformational changes[97, 61, 60, 88].
1.3 Energy and thermodynamic potentials
The three most common thermodynamic potentials are: internal energy, en-
thalpy and Gibbs free energy.
Internal Energy The internal energy (denoted as U or E) of a thermody-
namic system is the total kinetic energy due to the motion of particles and
the potential energy associated with the vibrational and electronic energy
of atoms, including the energy of chemical bonds. Internal energy does not
include the kinetic energy due to the motion of the system as a whole. It
does not account for potential energy due to the position of the system in an
external gravitational, electric or magnetic eld.
Introduction 8
The internal energy is essentially dened by the rst law of thermody-
namics, which states that energy is conserved:
U = Q+W +W

(1.1)
Where U is the change in internal energy of a system during a process, Q
is heat added to a system, W is the mechanical work done on a system,
and W

is energy added by all other processes.


Most biological interactions occur in a uid environment. In such an en-
vironment the mechanical work done on the system is related to the pressure
(P) and the volume (V ):
W = PdV (1.2)
The heat energy is a function of temperature (T) and entropy (S):
Q = TdS (1.3)
Thus, the internal energy of a system of biological interest may be expressed:
dU = TdS PdV (1.4)
Enthalpy Enthalpy or heat content (denoted as H or H) describes the
amount of useful work that may be obtained from a closed thermodynamic
system, under constant pressure. In the absence of an external eld, enthalpy
Introduction 9
is dened as:
H = U +PV (1.5)
Where U, P and V are internal energy, pressure and volume respectively. En-
thalpy is sometimes referred as heat capacity because under constant pressure
and volume:
dH = Q TdS (1.6)
Thus, the dierence in enthalpy is the maximum amount of thermal energy
that may be obtained from a system.
The total enthalpy of a system cannot be measured directly. H the
change in enthalpy is measured instead. In exothermic reaction at constant
pressure, the change in enthalpy equals to the energy released from the sys-
tem. Similarly, in endothermic reactions, H equals to the energy absorbed
by the system. If the system is kept under constant pressure and constant
volume, the change in enthalpy equals the heat amount that is released from
or absorbed by a system.
Introduction 10
Gibbs Free Energy The Gibbs free energy (which frequently is referred to
simply as free energy), is dened as:
G = U +PV TS+
= H TS+
=

i=1

i
N
i
(1.7)
Where: U is the internal energy; P is pressure; V is volume; T is the tem-
perature; S is the entropy;
i
is the chemical potential of the i-th chemical
component; N
i
is the number of particles (or number of moles) composing;
the i-th chemical component. It can be shown that
G = H TS (1.8)
Where S is the change in the internal entropy of the system. The value
of G from equation (1.8) is used to determine whether a chemical reaction
is favorable or not: reactions with G < 0 will occur spontaneously, while
those with G 0 will not.
Binding Anity Non-covalent receptor-ligand interactions may be written
in the following general form[72]:
RL
k
d

k
a
L +R
Where: R, L and RL are the receptor, ligand and receptor-ligand complex,
respectively; k
d
and k
a
are kinetic constants of dissociation and associa-
Introduction 11
tion, respectively. This reaction describes dissociation of a receptor-ligand
complex. The thermodynamic equilibrium constant of this reaction in ideal
conditions is dened as:
K
d
=
[R][L]
[RL]
(1.9)
Where [X] denotes the molar concentration of the component X. The equi-
librium constant can be related to the change in the Gibbs free energy (eq.
(1.8)) of the above dissociation reaction:
G = G
0
= RTlnK
d
(1.10)
Here, R is the universal gas constant, and T is the absolute temperature.
G
0
is the free energy change at equilibrium under standard conditions (all
the chemical components are at 1M concentration, T=273.15K, pressure =
1atm).
In attempts to calculate the change in free energy upon binding (free
energy of binding), it is customary to separate the overall energy into distinct
components. These components usually may include entropy loss due to
association, entropy gain of water due to binding of the ligand (hydrophobic
eect), entropy loss in the receptor and the ligand due to constraints of
internal degrees of freedom, interaction between the ligand and the receptor,
and changes in the conformational (internal) energy of the molecules upon
binding.
Introduction 12
The basic assumption of most of the works on experimental or computa-
tional determination of binding energy is that dierent contributions to the
binding energy are independent and additive. Thus binding energy may be
written as a sum of its components[72]:
G
bind
= G
solvent
+
+ G
receptor
conf
+ G
ligand
conf
+
+ G
int
+
+ G
motion
(1.11)
. One should note that, based on the principles of independence and addi-
tivity of energy components, many other variants of this equation may be
written. Furthermore, the same assumption of additivity and independence
allows the creation of statistical functions that approximate the binding free
energy without direct connection to the underlying physical and thermody-
namic processes.
1.4 Common energy components
Based on the equation (1.11), energy calculations are divided into distinct
components. In this section I will describe the most commonly used terms
of energy functions. This list is by no means complete, but rather serves as
a brief introduction.
Introduction 13
Physically based potentials are mainly divided between bonding and non-
bonding expressions. Supplementary expressions for solvation or entropy loss
due to restricted rotations are sometimes added.
Non-bonding expressions
It is common to model pairwise interactions between atoms that are divided
by at least 4 covalent bonds in terms of electrostatic (Coulomb) and Van der
Waals interactions.
Coulomb potential We use Coulomb potential to estimate the enthalpy
contribution of any two charged particles to the overall potential energy:
E
el
=
Q
1
Q
2
r
(1.12)
Where Q
1
and Q
2
are the partial charges of the two particles, r is the distance
separating between them, and is the dielectric constant of the separating
medium. In vacuum, equals 1. Figure 1.2 shows a typical shape of electro-
static potential of charged particles.
Hydrogen bonds The hydrogen bonds (H-bonds) eect is highly related to
electrostatic interactions. This eect is caused by interaction of electroneg-
ative atoms with hydrogen connected to other electronegative atoms. The
nature of H-bonds allows charge transfer along the bond. The strongest H-
bond eect is achieved when the three interacting atoms (hydrogen donor,
hydrogen atom and hydrogen acceptor) and the mediating lone electron pair
lie on a single line. To account for this directionality, many force elds con-
Introduction 14
Figure 1.2: Typical shapes of electrostatic interactions energy. The energy of two identical
(full line) and opposite (dashed line) charges in vacuum are shown
tain explicit terms for the angle of the H-bond. For example, following is
the H-bond component of MM3 force eld[58] that demonstrates an explicit
term for the angle between the interacting atoms:
E
HB
=
#
_
1.84 10
5
e
120/P
2.25
P
6
D
_
l
l
0
_
cos
_
(1.13)
Where l and l
0
denote the actual and the reference H-bond lengths, respec-
tively,
#
is the depth of the energy potential well, P is the ratio of the sum
of the van der Waals radii of the atoms divided by the sum of the eective
interatomic distances between them and D is the dielectric constant. The
dependence of energetics on the angular relations of H-bonds plays an im-
portant role in the specicity of molecular interactions. Figure 1.3 shows
examples of typical inter- and intra-molecular hydrogen bonds.
Introduction 15
Figure 1.3: Examples of inter- (left) and intra- (right) molecular H-bonds
The majority of existing scoring functions does not include explicit terms
for hydrogen bonds[54], but rather rely on Van der Waals or electrostatic
interactions.
Van der Waals interactions Van der Waals (VdW) forces account for both
attraction and repulsion of non bonded atoms. Usually, Van der Waals en-
thalpy contribution of atoms is estimated using the Lennard-Jones (LJ) po-
tential:
E
V dW
=
N1

i
N

j=i+1
_
4
ij
_
_

ij
r
_
6

ij
r
_
12
__
(1.14)
Where
ij
is the depth of the potential well between the atoms i and j, r is
the distance between two atoms,
ij
is the distance at which the inter-particle
force is zero and N is the number of atoms.
Equation (1.14) is sometimes referred to as the 6-12 LJ potential, as op-
posed to 4-10 potential, a more smoothed estimation with lower repulsion
eect. Figure 1.4 presents the shape of the Van der Waals potential of two
identical atoms. Although the equation (1.14) is the most encountered
one, there are other ways to estimate Van der Waals energy (for example
Hills equation[38]).
Introduction 16
Figure 1.4: Van der Waals interaction energy of argon dimer. Taken from the Wikipedia
[113] under the GNU Free Documentation License
Bonding expressions
The three most common terms that describe the contribution of bonding
interactions to the overall energy are bond stretching, angle bending and
bond rotation (torsion).
Bond stretching One of the equations that describe the potential energy
for a covalent bond is:
E
stretch
= D
e
_
1 e
(rr
0
)
_
2
(1.15)
In this equation (which is often referred to as a Morse equation), D
e
is the
depth of the energy minimum, r
0
is the reference bond length, =
_
/2D
e
,
where is the reduced mass and is the bond vibration frequency.
To simplify the energy calculations, a harmonic potential is often applied
to bond stretching (Hookes law). Although less accurate, harmonic potential
Introduction 17
Figure 1.5: Comparison of Morse (dashed line) and Hookes harmonic (full line) poten-
tials of bond stretching energy around the minimum. To construct this graph, all the
parameters in equations (1.15) and (1.16) were assigned the value of 1
is faster to calculate and is accurate enough in the bottom of the potential
well.
E
stretch
=
1
2
k(r r
0
)
2
(1.16)
Figure 1.5 presents the shapes of Morse and Hookes potentials around the
minimum.
Angle bending The angle bending contribution to the potential energy may
be estimated using the following equation:
E
bending
=
1
2
(
0
)
2
_
1 k
1
(
0
) k
2
(
0
)
2
k
3
(
0
)
3
. . .

(1.17)
Where is the angle,
0
is the reference angle and k
1
, k
2
, . . . are force con-
stants specic to the bonds that form the angle. A good approximation of
Introduction 18
this general form equation is Hookes harmonic potential:
E
bending
=
k
2
(
0
)
2
(1.18)
Bond torsion One of the possible equations that describe the contribution
of torsions around chemical bonds is
E
torsion
=
N

n=0
C
n
cos
n
() (1.19)
Where C is some force constant, is the torsion angle, and N the num-
ber of rotating bonds. Although many force eld terms of bond torsion
contain the above equation, there is sometimes a need in more accurate esti-
mations. On the other hand, many force elds do not contain explicit terms
for torsions[54]. In these cases non-bonding terms for Van der Waals and
electrostatic interactions are used to achieve the desired potential prole.
Entropy estimation and solvation terms
A solute molecule that leaves the solution in favor of a complex with another
molecule produces two main eects on the systems entropy. First, it changes
the micro-structure of the water bulk that surrounds the two solute molecules.
This change results in more water molecules that are capable of creating
hydrogen bonds between themselves. The second eect is the change in the
internal degrees of freedom.
Entropy change estimation is one of the most challenging problems in
computational research of biological systems. The reason for the complexity
Introduction 19
of this task may be demonstrated by the Gibbs entropy formula:
S = k
B
N

i=1
p
i
log p
i
(1.20)
Where N is the number of possible discrete states of a system, and p
i
is the
probability of a certain state. Equation (1.20) results in a huge complexity.
The large number of possible states of a system leads towards very small
values of p
i
, which in turn requires extensive sampling and may lead to
large accumulation of errors. Several additional ways to exactly evaluate the
entropy exist, but they do not change the complex nature of the calculations.
For a review on entropy calculations in biological systems see ref.[3].
1.5 Force elds and scoring functions
During the process of docking, many conformations are searched. The pro-
gram needs to choose between the dierent conformations, thus each confor-
mation is given a numerical value, which in most of the cases, is supposed
to represent its relative stability. Computational functions that estimate
the energy of the system can be based on the principles of classical physics
(force eld based functions). Another class of functions combines statistical
physics equations with many approximations that are based on known macro-
structures. This class of methods is often called approximate or knowledge
based functions[82]. In addition purely statistical scoring functions exist.
Such functions are based on statistical analysis of various patterns, such as
distribution of contacts between dierent types of atoms[69]. Another ap-
Introduction 20
proach of the estimation of the tness of docking structures is to use shape
complementarity.
1.5.1 Force eld based energy functions
Force eld based scoring functions are based on the equations that were
mentioned in Section 1.4. Two major such energy functions are AMBER
[14] and CHARMM [65]. These functions dier in atom typing, parameters
for the various terms and in the basic equations that build them up. The
main equation of the AMBER force eld reveals the complexity that is
common to all the energy functions in this class:
E
total
=

bonds
K
r
(r r
0
)
2
+
+

angles
K

(
0
)
2
+
+

dihedrals
V
n
2
[1 +cos(n phase)] +
+

i<j
_
A
ij
r
12
ij

B
ij
r
6
ij

q
j
q
j
r
ij
_
+
+

i<j
_
C
ij
r
12
ij

D
ij
r
10
ij
_
(1.21)
In this equation, the last term is the estimation of hydrogen bonds energy.
The rest of the terms have already been discussed. A review of CHARMM,
AMBER and other common force elds has been recently published[64].
Due to the complexity of force eld based scoring functions, they pose
relatively heavy computational load on the computer, which results in rela-
tively low calculation speed. Thus, in the case of the docking problem, the
Introduction 21
full forms of these functions are mostly suitable for structure preparation
before docking or during the post-docking processing.
1.5.2 Approximate energy functions
As stated before, one of the major drawbacks of force eld based scoring func-
tions is their extensive computational cost due to the large number of energy
terms and their complexity. Moreover, several terms, such as solvation eect,
the contribution of the exibility to the overall system energy and others re-
quire sampling of multiple conformations in the solution space. To overcome
this obstacle, several knowledge based potentials have been proposed. In this
class of functions, the number of energy terms and the number of supported
atom or bond types are reduced. The general form of the remaining terms
resembles that of the force eld based functions. The parametrization is done
using statistical analysis of known structures of macromolecules. The struc-
tures are chosen according to the problem and may include folded proteins,
proteins bound to other proteins, small molecules, DNA, etc. It is possible to
perform calibration of the parameters using focused sets of structures (target
tailored functions). Studies exist that show that such a strategy improves the
accuracy of scoring functions[11, 92]. Because the parametrization of knowl-
edge based scoring functions is done using known macro-structures, they
implicitly account for entropic eects such as solvation and changes in inter-
nal degrees of freedom. Estimation of entropic and solvation contributions to
the overall binding anity is usually done using one or more of the following
terms[109, 49, 70]: hydrophobic match, solvent accessible surface (divided to
Introduction 22
atom types according to the extent of hydrophobicity/hydrophilicity), and
the number of internal degrees of freedom (usually, the count of rotatable
bonds). This support of entropic terms is gained without the costly compu-
tations.
On the other hand, the calibration process does not account for non-native
structures. This might lead to meaningless results when one attempts to
quantitatively evaluate poses that reside far away from an energy minimum.
Most existing docking programs (for example AutoDock [33, 71, 70],
FlexX [49], FlexE [12], Glide [23], GOLD [42] and others) use approxi-
mate scoring functions. It is possible to compensate for the relative lack of
accuracy of this class of functions by further re-scoring docking candidates
with or without an additional simulation step (such as minimization, molecu-
lar dynamics). This multistage approach was successfully adopted by several
research groups[8, 108]. For example, in one work[108], molecular dynamics
combined with MM-PBSA (molecular mechanics Poisson-Boltzman/surface
area) were used to re-rank the solutions suggested by DOCK 4.0. In that
work, a conformation within 1.1

A RMSD from an HIV-1 RT inhibitor was


predicted before the 3D structure was published.
1.5.3 Statistical potentials
Another approach to simplify energy calculations even more is to use purely
statistical potentials. One of such potentials was proposed by Miyazawa and
Jernigan[69]. In that work, intra-residue contacts in folded proteins were
examined. It was found that several residues can be found near the others
Introduction 23
with dierent propensities. These ndings were used to compare proposed
folded structures in terms of probability. A similar statistical approach was
also used to analyze the distribution of various intermolecular contacts in
protein-protein[29] and protein-ligand[79] complexes. Similar to semi-empiric
scoring functions, statistical potentials account implicitly for solvation and
other entropic eects and, on the other hand, of a limited validity when
analyzing non-native structures.
Generally, statistical potentials provide high calculation speed which, un-
fortunately, comes at the expense of accuracy. Preliminary results during the
early stages of my research with the probability tables provided in the work
of Glasser et al[29] have generated unacceptable results. (These results are
neither shown, nor discussed in this work).
1.5.4 Geometric and chemical complementarity functions
When two molecules bind to each other, a certain degree of shape comple-
mentarity has to exist[74, 43]. This notion serves as a rationale behind shape
complementarity or geometry complementarity scoring functions. Geometry
complementarity was the exclusive scoring scheme in many early docking
programs[74, 20, 25, 7]. The current scoring functions use additional cri-
teria in order to facilitate the accuracy. For example, in work by Bohacek
and McMartin[68], the accessible protein surface was divided into hydropho-
bic, hydrogen-bond donating, or H-bond accepting zones. Other criteria for
accessing chemical complementarity are based on partial charges, hydropho-
bicity/hydrophilicity, atom types, etc.
Introduction 24
1.6 Energy funnels
As stated earlier, energy is a complex function that depends on an enormous
number of variables. The multidimensional hypersurface that describes the
energy as a function of all the relevant variables is known as the energy land-
scape. According to eq. (1.8), at equilibrium, any thermodynamic system is
supposed to reside in a minimum (local or global) of such a landscape; other-
wise, the system would spontaneously move until it reaches one. One should
note that most probably, multidimensional energy hyperspace contains many
local minima, as opposed to a single global one.
The existence of funnels in the energy landscape has been proposed
for protein folding[16, 56, 105, 100, 52] and has been further expanded for
protein-protein[99, 115] and protein-ligand recognition[109, 96]. It has been
suggested that the shape of the energy landscape is in correlation with the
nature of protein folding or binding between the molecules.
The funnel shape energy landscape theory suggests that structures with
single-minimum energy landscape may represent an extremely stable folded
structure or the lock and key binding mechanism. Several minima on
the bottom of the energy landscape with small barriers between them may
be a result of induced t or non specic interactions. Finally, a rugged
landscape with multiple minima separated by relatively high energy barriers
may indicate domain swapping or the existence of multiple binding modes .
Introduction 25
1.7 Multiple binding modes
As was presented previously, biomolecules are exible and mobile entities.
The molecular thermal motion results in a reality that is dramatically dif-
ferent from the static picture that is seen in structures solved with X-rays
or even in the multiple structures obtained by NMR. Although eq. (1.8)
implies that any system at thermodynamic equilibrium resides in a single
energy minimum, the real-life situation is quite dierent. The constant ther-
mal motion and ever-changing environmental conditions prevent thermody-
namic equilibrium, and energy barriers may rule out the transfer between
one conformation to another, potentially more stable one.
At a non-zero temperature, thermodynamic systems are able to occupy
non-minimal regions of the landscape according to the distribution.
N
i
N
=
e
(E
i
/kT)

j
_
e
(E
j
/kT)
(1.22)
In this equation (also known as the Boltzmann or Maxwell-Boltzmann dis-
tribution), N
i
is the number of molecules at equilibrium temperature T, in a
state i that has energy E
i
; N is the total number of molecules in the system
and k is the Boltzmann constant which, for gaseous and liquid systems is
identical to universal gas constant (R) from eq. (1.8). If the energy barrier
between two minima is low enough, and the temperature is high enough,
then the molecules in a system can alternate between multiple states. If the
dierences between binding energies (i.e. (G
bind
)) of two or more con-
formations is such that transformation of the system between them doesnt
eectively compensate for the separating energy barriers, these multiple con-
Introduction 26
formations may exist in the system simultaneously, presenting a phenomenon
known as multiple or alternative binding modes.
A growing body of data supports the existence of multiple binding modes
of ligands to receptors. These may manifest in the form of a ligand that binds
the same (or similar) protein in dierent distinct modes, or alternatively,
ligand molecules that share structural similarity may be observed in dierent
binding modes when bound to the same protein[18, 9, 44, 91, 57]. It is clear
that individual conformations of multiple binding modes, if they exist, may
have a unique contribution to the binding energies or specicity. The program
presented in this work, ISE-dock is capable to produce arbitrary large near-
optimal populations of docking solutions, resulting in an ecient sampling
of the energy hyperspace and increasing the chances of detecting alternative
binding modes.
1.8 Docking techniques
1.8.1 Flexibility in docking programs
The structural and energy considerations that were presented above imply
that accounting for exibility in docking programs is a necessary task. The-
oretically, accounting for molecule exibility in a system that contains N
atoms will result in 3N degrees of freedom (3 degrees of freedom for trans-
lating each atom). This number of degrees of freedom results in a colossal
rise in the computational complexity of docking calculations and cannot be
treated directly. In order to reduce the size of the solution space, several
Introduction 27
approaches are taken by, either alone or (more frequently) in various combi-
nations. These approaches include explicit exibility of only small parts of
the system; soft potentials and low resolution docking, and using multiple
conformations.
Selective exibility Among all the internal degrees of freedom that the-
oretically exist in the system, only dihedral torsions are usually taken into
consideration. This is due to the substantially lower energy barriers that
are needed for this type of movement, compared to bond stretching and an-
gle bending[54]. In addition, internal exibility is usually limited to certain
portions of the interacting molecules. Treating ligand exibility alone, and
keeping the protein rigid, reduces dramatically the combinatorial complexity
of a protein-ligand docking program. This approach is very popular. In fact,
most of the modern protein-ligand docking programs are capable of deal-
ing with full ligand exibility but not with the conformational changes of a
protein[95]. The rigidity of protein is a reasonable approximation in many
cases, and it has lead to several successes. Nevertheless, accounting for re-
ceptor exibility is a very important step toward improving the process of
docking[4, 46, 49, 12]. Najmanovich et al. have shown that in many cases
only a few side chains in the active side of a receptor change their confor-
mations during ligand binding[73]. In other cases, hinge-like movements of
large portions of the protein occur[89, 90], while retaining relative rigidity
of the remaining parts of the system. These ndings allow the user to par-
tially unfreeze the protein, while keeping a feasible combinatorial size of
the problem. Version 4.0.1 of the program AutoDock takes this approach,
Introduction 28
by allowing the user to specify the exible parts of the receptor (side chains
only). The ISE-dock program that is presented in this work (and was devel-
oped before the publication of AutoDock 4.0.1) takes a similar approach.
Hinge-based docking studies have also been reported[89, 90, 78].
Soft potentials Allowing partial inter-penetration of molecules by lowering
the repulsion penalties of VdW interactions is a way to implicitly account
for molecule exibility in docking simulations. For example, in a work by
Ferrari et al.[19], a modied, softer, Lennard-Jones potential was used in
order to screen large libraries of molecules against T4 lysozyme, a protein
that undergoes small conformational changes when binding dierent ligands.
Yet another way to allow intermolecular penetration to handle implicitly
protein exibility is to use proteins C

only in the rst stages of the docking


(low resolution docking)[103, 102].
Multiple conformations Flexibility of the interacting molecules may be
simulated by using multiple structures. The ways to obtain these struc-
tures include utilization of multiple X-rays structures, ensembles of struc-
tures obtained by NMR techniques, and the results of molecular dynam-
ics or other simulations. Three major ways exist to use multiple molecular
structures in protein-ligand docking studies: separate docking of a ligand
into each individual protein structure[19, 94], identifying the conformational
changes and considering their combinations[12], and using energy functions
that consider an energy-weighted or geometry-weighted average of the mul-
tiple structures[46, 70].
Introduction 29
One of the important advantages of using multiple conformations is that,
unlike the rest of the previously mentioned methods, it easily allows the
movement of side chains and the backbone to be considered. In addition,
point mutations or even completely dierent proteins may be considered in
a single docking study.
1.8.2 Search algorithms
Docking algorithms can be roughly divided into two categories: those explor-
ing the energy landscape of the system and those (re)constructing the ligands
in the binding pockets of the macromolecule. The rst class is represented by
various implementations and combinations of Simulated Annealing (SA), Ge-
netic Algorithm (GA), molecular dynamics, geometric complementary match
etc. Examples of the implementations of this approach are: Dock3.5 [53],
AutoDock [33, 71, 70], and GOLD [42].
The second class of algorithms (represented by FlexX[49]) involves plac-
ing of one or more base fragments of the ligand into the binding pockets of
the protein and reconstruction of the remained molecule according to prede-
ned criteria. This approach is much faster and gives good results in cases
where the binding site has a deeply buried pocket with the ability to make
hydrogen bonds. However, if the binding pocket is shallow, or the main
contribution to the binding process is done by hydrophobic interactions, the
placement of the base fragments and further reconstruction of the ligand are
doubtful[107].
Introduction 30
Genetic algorithms Genetic Algorithms (GAs) are a general-purpose fam-
ily of optimization techniques that mimic the process of evolution[45]. During
the optimization process, an instance of the problem is encoded using linear
representation (chromosome). In the rst step multiple random congura-
tions (individuals) are generated, and a tness function is calculated for each
of them. During the subsequent steps, several operators may be applied
to some of the individuals, such as point mutation in the chromosome or
cross-over exchange of the information encoded in chromosomes between two
individuals. The tness function is used to decide which individual is allowed
to survive to the next iteration and to produce osprings. GOLD (Genetic
Optimization for Ligand Docking)[42] was the rst docking program to use
GA. GOLD performs automated docking with full acyclic ligand exibility,
partial cyclic ligand exibility and partial protein exibility in the neighbor-
hood of the binding site. The location of the site must be provided by the
user (with a possibility of using other software). Another GA-based docking
program, AutoDock [70] uses local search techniques to modify the en-
coding chromosome, and to propagate the optimized genetic information
to the next generations. For detailed a description of AutoDock and its
algorithm, see Section 2.2 (page 37).
Monte Carlo simulated annealing Monte Carlo Simulated Annealing (SA)
techniques involve random alteration of the system that undergoes optimiza-
tion. If the change creates a conformation with lower (better) values of the
scoring function, then the new structure is accepted for the next steps. If,
on the other hand, the energy increases, the new structure is accepted with
Introduction 31
a temperature-dependent probability P = e
(E
t1
E
t
)/(k
B
T)
. Where E
t1
and E
t
are the energy values before and after the random change, k
B
is
the Boltzman constant, and T it the temperature. During the SA process,
the temperature T is reduced according to the predened scheme (cooling
schedule), resulting in less permissive acceptance criteria. The MCDOCK
[1] program uses SA to solve the docking problem. The conformations are
generated using geometry-based docking and then energy-based docking is
performed.
FlexX FlexX is an incremental docking program[85]. It binds exible lig-
ands into the binding pockets of a rigid receptor. FlexX involves three
steps: selection of base fragments in the ligand molecule, placement of these
fragments in the active site of the receptor, and incremental reconstruction
of the whole ligand. The reconstruction is made fragment by fragment so
that the energy of the complex is locally minimal. For a better sampling,
the algorithm is allowed to diverge to various energetically favorable regions.
This algorithm saves only a limited number of the best scoring partial so-
lutions to continue to the next round of ligand reconstruction. Since the
greedy algorithm selects only the best partial solutions to continue to the
next round of ligand reconstruction, exible docking is likely to be more
demanding on the quality of the scoring function used to evaluate (partial)
docking solutions[47]. FlexX-Ensemble (formerly known as FlexE)[12]
introduces a new feature to the FlexX algorithm. FlexX-Ensemble takes
into account exibility of the receptor by using a predened ensemble of re-
ceptor conformations. The ensemble may be derived from multiple X-ray
Introduction 32
structures or homology modeling or generated by molecular dynamics simu-
lations. The protein is dissected into a constant (rigid) and several exible
parts. The exible parts may be combined to create conformations that are
not observed in the original ensemble of the structures.
Internal Coordinate Mechanics Internal Coordinate Mechanics (ICM)
performs global optimization of a exible ligand in the receptor eld[2]. This
algorithm is based on a large number of random moves with gradient local
minimization. A history mechanism is used to escape local minima.
Computer vision techniques Image recognition techniques were described
in a review by Nussinov and Wolfson[77]. These methods were implemented
on rigid and exible docking. While the shape complementarity of molecules
that were crystallized together (bound docking) is expected to be good, the
docking of unbound molecules is less trivial.
1.8.3 Evaluating docking programs
Comparing docking programs is not a trivial task. Many criteria to perform
this task have been proposed and used in the literature[34, 13, 110, 44, 25].
The most common criterion to assess the correctness of a docked com-
plex, compared to the experimentally determined structure is to compare
the Cartesian coordinates of the solution and the reference structure. This
comparison is reported as the Root Mean Squared Deviation (RMSD) of
Introduction 33
atoms:
RMSD =
_

N
i,j
[(x
ij
)
2
+ (y
i,j
)
2
+ (z
ij
)
2
]
N
(1.23)
Lack of specicity, inability to dierentiate between more and less important
regions in a complex, and the need for a reference structure are several pitfalls
of this measure[13]. Nevertheless, RMSD is the measure of choice the vast
majority of docking techniques. It is widely accepted to treat solutions with
RMSD values below 2.0

A as successful ones. Other methods of evaluating


include modied deviation functions[1] and accounting for correct positioning
of intermolecular contacts[50]. An additional approach is to screen a large
library of compounds with only a few that is known to bind eciently to
a molecular target. In this type of test, the enrichment factor of correctly
recognized binding molecules is checked[84, 24, 26].
Theoretically, only the lowest (best) scored docking pose of a ligand needs
to be examined. But there are several factors that require treating multiple
docking solutions. Among them are mobility of the molecules, inaccuracy of
scoring functions, and the fact that molecules are not always found in their
global minimum. Due to these reasons, it is customary to check the best
available deviation from the experimentally known structure among several
solutions that were provided by a docking program. The comparison of
multiple docking solutions is thought to downscale the dependency of the
results on a scoring function, and to better reect the ability of the docking
algorithm[34].
Introduction 34
1.9 Open problems and issues
Protein-ligand docking is a valuable tool in the processes of drug discovery
and lead optimization and during the basic study of intermolecular inter-
actions. Protein-ligand docking was successfully used in a wide range of
problems, but despite the plethora of existing solutions, the docking problem
is far from being solved. Many of the existing programs tend to converge
around a certain local minimum or not to converge at all. Sometimes, bio-
logically irrelevant solutions are produced.
Most of the existing programs are capable of proposing multiple docking
poses, but the time needed by many of them to do so increases with the size
of the output population.
Protein exibility is another and, perhaps, the most dicult and urgent
challenge in the protein-ligand docking eld. Other degrees of freedom that
are very important, but are hardly addressed by the existing docking pro-
grams are the position of mediating water molecules[85], co-factor position-
ing, electron transfer and protonation states of the interacting molecules.
Any docking program is tightly connected to at least one scoring function.
Although the development of a scoring function is beyond the scope of this
work, one should remember that the choice of such a function has a direct
impact on the docking program performance.
Chapter 2
Methods
2.1 Energy function
An ideal scoring function in a protein-ligand docking program would combine
speed and the ability to distinguish quantitatively between native and non-
native poses. Developing a scoring function is beyond the scope of this work.
ISE-dock uses AutoDocks grid-based scoring function[70]. Auto-
Grid, a part of AutoDock suite, pre-calculates grids of Van der Waals,
electrostatics, and solvation interactions of a biomolecular target, based on
atom types. Following are the terms that construct the scoring function used
35
Methods 36
in AutoDock and, subsequently, in ISE-dock:
G = G
V dW

i,j
_
A
ij
r
12
ij

B
ij
r
6
ij
_
+
+ G
hbond

i,j
E(t)
_
C
ij
r
12
ij

D
ij
r
10
ij
+E
hbond
_
+
+ G
elect

j,j
q
i
q
j
(r
ij
)r
ij
+
+ G
sol

i
C
,j
S
i
V
j
e
(r
2
ij
/2
2
)
+
+ G
tor
N
tor
(2.1)
The ve G terms in this equation are empirically determined using lin-
ear regression analysis, correlating a set of 30 protein-ligand complexes with
known binding constants and solved 3D structures. The rst and the third
terms of the above equation are standard expressions for VdW and elec-
trostatic interactions, respectively. In the second (H-bond) term, E(t) is a
directional weight based on H-bonds angle, t and E
hbond
is the estimated
average energy of hydrogen bonding between water molecules and a polar
atom. The unfavorable entropy eect of ligand binding (the fths term) is
a function of the number of sp
3
bonds N
tor
. The solvation term of eq.
(2.1) considers fragmental volumes of only carbon atoms in the ligand (i)
and all atom types in the receptor (j). Parametrization of the carbon atoms
distinguishes between aliphatic and aromatic atom types. The constant co-
ecients in equation (2.1) (A
ij
, B
ij
, C
ij
and D
ij
) are specic for each pair of
atom types.
Methods 37
During the docking process, the program evaluates any position of the
ligand by interpolating over those grids for the protein-ligand interaction
of each atom of the ligand according to its current position and adding the
internal conformational energy of the ligand. By default, the docking box has
the dimensions 22.5

A 22.5

A 22.5

A with a resolution of 0.375

A between
grid points. The version of AutoDock used in this work (3.0.5) supports
eight atom types: C (aliphatic carbon), A (aromatic carbon), N (nitrogen),
O (oxygen), S (sulfur), H (hydrogen), X and M (spare types for additional
atoms such as metal, halogen, phosphorus etc). It is customary[106, 98, 17]
to substitute the original AutoDock parameters for Zinc. We used the
following parameters, which lead to more accurate energy calculations[39]:
(radius: 0.87

A; well depth: 0.35 kcal/mol; and charge: +0.95e).


The use of grid-based scoring functions has two important properties:
rst, the simulation speed is facilitated signicantly and second, it implies
that there can be no variation in the protein structure during the docking
process.
2.2 AutoDock docking program
The AutoDock program[32, 33, 70] served as a source code base and a
reference point for ISE-dock performance. The source code of AutoDock
(version 3.0.5) was obtained from the authors. AutoDock performs exible
ligand rigid protein docking using one of the following algorithms: Simu-
lated Annealing (SA), Genetic Algorithm (GA) and Lamarckian Genetic Al-
gorithm (LGA). LGA is a hybrid optimization algorithm that deviates from
Methods 38
GA and has been shown[33] to give the best quality of performance out of
the three available ones. AutoDock was the most cited docking program
in the scientic literature during the years 2001 2005[95].
2.2.1 Lamarckian Genetic Algorithm
Genetic Algorithm (GA) is a general type of optimization algorithms, and it
exists in several variants. The version of GA that is used in AutoDock is
described as Algorithm 2.1.
Algorithm 2.1 Genetic Algorithm used in AutoDock [70]
Require: string representation of a problem (chromosome)
1: create random population P
2: repeat
3: mate random pairs of individuals (crossover)
4: perform random mutations
5: for all i P do
6: evaluate i
7: end for
8: sort P according to the scoring function
9: select best individuals to survive to the next iteration
10: until stopping criteria are met
11: return best individual
In order to be optimized by GA, an instance of a problem is encoded into
a at string (chromosome), which may be subjected to several GA operators
and then scored. The GA operators include crossover of two chromosomes
and point mutations. These operators (Algorithm 2.2, lines 3 and 4 in Al-
gorithm 2.1) are applied randomly with user-dened probability. The opti-
mization terminates if no improvement in scoring function is achieved over a
number of generations or after a specied number of generations.
Methods 39
The LGA diers from the canonical GA by an additional step of local
optimization. The addition of local optimization provides that an acquired
adaptation of an individual promotes changes in its chromosomes that in
turn pass to the next generations.
The LGA is described using pseudocode as Algorithm 2.2.
Algorithm 2.2 Lamarckian Genetic Algorithm used in AutoDock [70]
Require: string representation of a problem (chromosome)
1: create random population P
2: repeat
3: mate random pairs of individuals (crossover)
4: perform random mutations
5:
6: select sub population S P to undergo local optimization
7: for all individual i S do
8: perform local optimization of i
9: modify is chromosome to reect the optimized state
10: end for
11:
12: for all i P do
13: evaluate i
14: end for
15: sort P according to the scoring function
16: select best individuals to survive to the next iteration
17: until stopping criteria are met
18: return best individual
Here, an instance is encoded in a chromosome (genotype), which in turn is
translated to a phenotype. At the initial stage, the phenotype is identical to
the genotype. As in the basic GA, several operators are applied randomly on
the population with the predened probabilities (Algorithm 2.2, lines 3 and
4). At the next stage, several individuals are randomly selected (Algorithm
2.2, line 6). These individuals undergo local optimization of the phenotype.
Methods 40
The optimized phenotype is translated back to a new genotype, which then
propagates to the next generations.
AutoDock uses Solis Wets[93] local optimization. Solis Wets (SW) local
optimization algorithm is a greedy local search heuristic method. During
the SW search, random moves along all the axes of the solution space are
performed until an improvement is found. The variance of the random moves
is inuenced by the frequency with which improving moves are found. The
pseudocode of SW is provided as Algorithm 2.3.
Algorithm 2.3 Pseudocode of Solis Wets local search algorithm
initialize variance
numberOfSuccesses = numberOfFalures = 0
repeat
perform random move using variance
calculate currentEnergy
if currentEnergy < previousEnergy then
numberOfFalures = 0
increase numberOfSuccesses
if numberOfSuccesses > threshold then
numberOfSuccesses = 0
expand variance
end if
else
numberOfSuccesses = 0
increase numberOfFalures
if numberOfFalures > threshold then
numberOfFalures = 0
contract variance
end if
end if
until stopping criteria are met
There are no robust stopping rules for SW algorithm, since the conver-
gence isnt guaranteed. In AutoDock, the SW search step stops after the
Methods 41
variance of the random moves drops below a threshold or when a specied
number of optimization steps is reached. The original SW algorithm uses
random steps with equal variances for each degree of freedom. In an attempt
to improve the optimization results, AutoDock enables separate lists of
random moves variances for each degree of freedom to be kept. This variant
of the SW method is referred as Pseudo Solis Wets (PSW) local search.
2.2.2 Problem representation
In AutoDock, the ligands pose relative to the protein and its internal
conformation are encoded using a vector of real values. The rst three values
in the vector dene the translation of the ligand. The rotation is encoded
using quaternion notation. This notation represents rigid body rotation using
a unit vector (represented by three numbers) and an angle of rotation around
this vector. Thus, the three degrees of freedom of rigid body rotation are
represented by four degrees of freedom in AutoDock. The rest of the
values in the chromosome vector represent the dihedral angles of the ligands
rotatable bonds.
Since AutoDock uses a grid-based energy function, the receptor is rep-
resented by a set of pre-calculated grid maps.
During the docking process, the encoding vector (genotype) is gener-
ated and translated to molecule coordinates (phenotype). A local search
is randomly applied with user-dened probability. The applied local search
changes the coordinates. This change is translated to respective changes of
the genotype.
Methods 42
All the algorithms used by AutoDock result in a single optimized pop-
ulation. Frequently, more than one solution is desired (as explained in Chap-
ter 1). In this case, the program may be congured to perform the docking
procedure several times. The total time needed to produce multiple docking
solutions is proportional to the number of desired structures.
2.3 ISE-dock program
The ISE-dock program was implemented as a set of added and modied
classes in AutoDock source code, and uses its energy function. As in the
original AutoDock application, molecular exibility is treated by allow-
ing changes in dihedral angles. Our representation of rigid body rotation
diers slightly from the one that is implemented in the original program:
while AutoDock encodes rotations using quaternions, ISE-dock uses sub-
sequent rotations around the X, Y and Z axes, as this option provided better
results in preliminary experiments with ISE. ISE simulations produce large
populations of docking poses, which is one of the standard results of applying
this algorithm to any problem. The number of energy evaluations performed
by the program is not aected by the size of the docking population. Since
energy evaluations account for more than 85% of the CPU time, the time
needed to complete the docking is practically independent of the number of
docking solutions that the program produces. The number 4096 (2
12
) was
chosen to limit the sorting of poses, mainly dictated by the available space
on our hard disks.
Methods 43
2.3.1 Iterative Stochastic Elimination algorithm
The Iterative Stochastic Elimination algorithm (ISE) is described as pseu-
docode in Algorithm 2.4.
This is a general optimization algorithm that can be applied to any prob-
lem described by independent variables and a set of discrete values for each
variable. In the case of ISE-dock, the variables are: translation (3 vari-
ables), rotation (3 variables) and bond torsion angles (one for each rotatable
bond). The algorithm begins by constructing a matrix that contains, for
each degree of freedom, a set of all possible values. This matrix is referred to
as possibilities pool. Two terms are required with respect to the possibilities
pool: problem size (PS) and pool depth (PD). Problem size is dened as
the total number of all possible combinations that can be generated from the
pool. Pool depth is the maximum number of remaining values among all the
variables that dene the problem.
PS =
N
variables

i
n
i
(2.2)
PD =
N
variables
max
i
(n
i
) (2.3)
Where N
variables
is the number of variables and n
i
is the number of possible
values for i
th
variable.
During the rst phase (referred as elimination phase), a large number of
conformations is generated. The conformations are generated by randomly
picking a single value from the pool, and assigning it to the respective vari-
able.
Methods 44
Algorithm 2.4 Iterative Stochastic Elimination Algorithm
Require: problem represented as a set of variables and possible discrete
values
generate pool
2: initialize population
while size(pool) < threshold do
4: generate sample S of s random congurations
for all i S do
6: perform local optimization with probability P
score
i
= evaluate(i)
8: if (size(population) < outptutSize) or (score
i
< score
max
) then
add i to population
10: trunk population to outputSize
end if
12: end for
sort S
14: L = low energy part of S
H = high energy part of S
16: for all variable var pool do
for all value val pool
var
do
18: observedLow = number of occurrences of (var, value) L
ratio = expectedLow(value)/observedLow(value)
20: if ratio > threshold then
rank = ratio/threshold
22: mark pool
var,value
for elimination with rank
end if
24: observedHigh = number of occurrences of (var, value) H
ratio = observedHigh/expectedHigh
26: if ratio > threshold then
rank = threshold/ratio
28: mark pair (var, value) for elimination with rank
end if
30: end for
eliminate up to e% values with highest rank from pool
var
32: end for
end while
34:
perform exhaustive search of pool, add best scored congurations to
population
36: return population
Methods 45
The randomly generated conformations have a certain probability (0.06
by default) to undergo local optimization. The main purpose of the local
optimization step is to solve clashes and unfavorable conformations that are
caused by the discrete nature of the variable values (translation, rotation
and torsions). Unlike local optimization by the Lamarckian Genetic Algo-
rithm in AutoDock, local optimization does not aect the variables in the
possibilities matrix, only the energy values that are associated with them.
The sample is evaluated and sorted. The sorted sample is divided into three
uneven parts: subsets of lowest, highest and intermediate energy conforma-
tions. The intermediate subset is not used in the analysis. A particular
value of a variable may be discarded from the pool of values if one of the
two following criteria is met. The rst criterion is the occurrence of a value
in the higher energy subset with signicantly higher frequency that is ex-
pected under the random distribution assumption. Alternatively, a value
may be eliminated if it appears in the lower energy subset with lower than
random frequency. Not more than a user-specied portion of values may
be discarded at each iteration (the default value is 10%). The elimination
process is performed iteratively until the number of possible conformations
enables exhaustive search in a feasible time. During the exhaustive phase,
the solution candidates have a probability of P = 0.6 (default value) to un-
dergo local optimization. Note that local optimization probability in this
exhaustive phase is ten time larger than the probability for local optimiza-
tion during the elimination phase. During the whole process, a list of the
best seen conformations is updated kept.
Methods 46
The local optimization steps, the limit of discarded values per iteration
and the fact that the best seen conformations are collected during the elimi-
nation phase are new to this implementation of ISE and were not present in
previously published ones[30, 31].
The sample size and the sizes of lower- and higher-energy subsets depend
on the current pool depth (eq. 2.2) and are user congurable, as is the
required ratio between the expected and the observed occurrences of a value
(Algorithm 2.4, lines 20 and 26). The maximal fraction of eliminated values
for each variable and the probability of local search during the elimination
and exhaustive phases are also determined by the user.
2.3.2 Problem representation
As in AutoDock, the ligands conguration is encoded by real values that
dene its position in space (translation), its orientation (rotations about
axes), and the internal rotations around single bonds. Unlike in AutoDock,
we have decided to use three degrees of freedom to describe the rotations of
the ligand around the principal exes. In our implementation, the rotation is
dened by sequential rotations of the molecule around the X, Y and Z axes
(in this order).
2.3.3 Protein exibility
Accounting for protein exibility is a very important task, which, until re-
cently was ignored by the majority of current protein-ligand docking programs[34,
97]. Proper inclusion of exibility (as a set of rotations around side-chain
Methods 47
Figure 2.1: Tearing o atoms to represent side chain exibility using phenylalanine as
an example. Dummy atoms are marked by the letter D in their names. The N, C

and
C

atoms on the receptor molecule overlap with their respective dummy counterparts.
and main-chain bonds) requires extensive changes to the current source code
of ISE-dock and is thus beyond the scope of this work. Nevertheless, before
further work is done, it is important to assess the ability of ISE to cope with
this problem. Docking experiments that account for protein exibility that
are presented in this work serve as a demonstration of ISE capabilities.
Side chain exibility in ISE-dock
As was previously described in Section 2.1 (page 35), the grid-based energy
function implies that the entire protein remains frozen during the docking
simulation. To overcome this limitation I have decided to transfer selected
atoms from the protein to the ligand, as the ligand may be treated with ex-
ibility. This is a technical choice to overcome that limitation of the original
program. Figure 2.1 describes the process. First, a set of exible residues
is identied using previous knowledge. Then, for each exible residue, all
the side-chain atoms, except for C

, C

and the adjacent hydrogen atoms,


are deleted. The resulting structure (the constant part of the protein) is
used in all further calculations, mainly for calculating the interactions on the
grids. In order to include the exible part of the receptor in the docking cal-
culations, the original coordinates of the side chains of the exible residues
Methods 48
are copied to the ligand molecule. In addition, backbones nitrogen atom is
also copied. A dummy bond connects between the residues N atom and an
atom from the ligand. We now have three atoms that are common to the
ligand and the protein. These overlapping atoms serve as reference points
in dening the side chains torsions: the atoms N, C

, C

and C

dene
1
;
C

, C

, C

, and C

dene
2
and so on. In order to prevent clashes penalty
due to the overlapping, the common atoms on the ligands size are marked
as dummy atoms. Dummy atoms are ignored during energy calculations.
All the atoms that originally belonged to the receptor molecule are excluded
from the operations of translation and rotation, thus only the dihedral angles
change during the ISE search.
The transfer of atoms from the receptor to the ligand breaks a cova-
lent bond between C

and C

. After the transfer, C

is considered a part
of another molecule. This means that C

interactions are interpreted


as intermolecular ones. Nevertheless, the distance between the two atoms
remains the distance of covalent CC bond. To prevent the large energy
penalty that would have been caused by this misperception, C

atom is also
marked as dummy. This measure means that C

atom is not included in


any energy calculation. Atoms transfer and exclusion of C

s from energy
calculations uneventfully leads to loss of accuracy. To test the validity of the
tearing o approach, I have docked only the side chains, with a ligand
molecule xed in its crystallographic position. In these experiments (data
not shown), the RMSD of the side chain atoms with respect to their observed
position was below 0.3

A.
Methods 49
Backbone exibility in ISE-dock
The tearing o approach that was undertaken to include side chain exi-
bility of proteins isnt suitable for exibility of the backbone due to various
technical limitations that are posed by the original code of AutoDock. In
this work, multiple protein conformations were used as a target bank for
the docking process. The multiple conformations of the protein were gener-
ated using the Iterative Stochastic Elimination algorithm[76, 86]. The ligand
is docked separately to each of the generated protein conformations, which
is kept frozen as usual. The results are combined according to the energy
values.
2.4 Rigid protein docking
2.4.1 LGA docking
LGA docking has been proposed to be superior to other methods in Auto-
Dock [70]. We have used the original (unmodied) AutoDock program to
obtain the results for LGA. As already mentioned, we substituted the default
Pseudo Solis Wets local optimization by the original Solis Wets algorithm.
We have also changed the default solution size from 10 to 35 in order to allow
AutoDock to perform as many energy evaluations ( 8.8 10
6
) as were
performed on the average by ISE ( 8.6 10
6
).
Methods 50
2.4.2 The data set
We used the public portion of the test set used by Perola et al[84] in their
comparison of docking algorithms. The original test set consisted of 150
pharmaceutically relevant protein-ligand structures, of which 100 are pub-
licly available. The preparation process of these structures was performed by
the Perola group[84]. We converted these les to mol2 format. Protein struc-
tures were kept in their bound conformation and were assigned charges from
the Kollman (United Atoms) forceeld [111, 112]. In this forceeld, heavy
atoms and the non-polar hydrogen atoms adjacent to them are treated as
single (united) spheres and the only hydrogen atoms that are accounted for
individually are the polar ones. Ligands, co-factors and metal ions were
assigned charges using the Gasteiger-H uckel method [28], which, unlike the
former, treats all the atoms separately. Charges assignments were performed
using Sybyl
(R)
7.1. Ligand rotatable bonds were marked by visual examina-
tion. After the preparation, any existing co-factors were merged with the
protein and treated as part of the appropriate protein model. Atom types
were assigned automatically by the appropriate utilities in AutoDock suite.
Of the 100 complexes, 19 were excluded due to the following reasons:
1 complex (PDB code: 830c) containing both zinc and calcium
6 complexes with a co-factor that contains Phosphorus atoms (due to
lack of validated parameters): 1aoe, 1dib, 1dlr, 1frb, 1syn, 7dfr
8 complexes with ligands that contain more than 8 atom types (this
limitation is imposed by AutoDock) 1qwx, 1ls, 1mq5, 1mq6, 1gl9,
1ydt, 2csn.
Methods 51
Table 2.1: PDB codes of the 81 complexes in the rigid protein test set.
13gs 1cim 1f0r 1h1s 1k1j 1nhu 1ydr 5std
1a42 1d3p 1f0t 1h9u 1k22 1nhv 1yds 5tln
1a4k 1d4p 1f4e 1hdq 1k7e 1o86 2cgr 7est
1a8t 1d6v 1fcx 1hfc 1k7f 1ppc 2pcp 966c
1afq 1efy 1fcz 1hpv 1kv1 1pph 2qwi
1atl 1ela 1fjs 1htf 1kv2 1qbu 3cpa
1azm 1etr 1fkg 1i7z 1l8g 1qhi 3erk
1bnw 1ett 1fm6 1i8z 1lqd 1qpe 3ert
1bqo 1eve 1fm9 1if7 1m48 1r09 3std
1br6 1exa 1g4o 1iy7 1mmb 1thl 3tmn
1cet 1ezq 1h1p 1jsv 1mnc 1uvt 4dfr
4 structures with incomplete protein structure in proximity to the lig-
and (cuto: 10

A) (1f4f, 1f4g, 1ohr, 1uvs) The remaining 81 complexes


are listed in Table 2.1.
2.4.3 Comparisons and their analysis
In this work I compare the performance of ISE-dock to that of AutoDock,
Glide and GOLD. AutoDock was chosen due to the fact that it allows
direct comparison of ISE and LGA search algorithms, without any bias from
the scoring function. The latter two programs showed the best performance
in a previous extensive analysis by Perola et al[84]. ISE and LGA results
reported are average values of three independent simulations with dierent
seed numbers of the random number generator. Glide and GOLD results
were kindly provided by Dr. E. Perola.
Methods 52
ISE algorithm is compared to LGA by using the same energy func-
tion. ISE is dierent than GOLD and Glide in both the search strat-
egy and the scoring. Such dierences, in search and in scoring, character-
ize most comparisons of docking programs. In all the tested programs, lig-
and exibility (torsion angles only) is accounted for, while keeping the pro-
teins rigid. Several protocols for comparing docking algorithms have been
proposed[34, 13, 110, 44, 25]. The choice of a particular comparison pro-
cedure frequently depends on the particular problem, the data set and the
programs under investigation. In order to be able to compare our results
to those obtained by Perola et al. with Glide and GOLD, we followed
their criteria[84] and used the RMSD of the top ranking solution versus the
corresponding crystal structure, and the best RMSD within the top 20 so-
lutions. We have also used the best RMSD within the entire docked set of
ISE and LGA as an additional criterion. This latter criterion indicates the
ability of the algorithm to cover the solution space, and is less dependent on
the scoring function. RMSD is calculated using heavy atoms of the ligands.
To examine the statistical signicance of those criteria, we added the paired
t-test (PTT).
2.4.4 Paired t-test
The need to apply statistical methods for comparing docking algorithms has
been recently suggested[13]. We have the RMSD results for each docking
experiment available for each of the algorithms to be compared (either ob-
tained by us or by Perola et al.[84]), therefore we can compare results of
Methods 53
ISE-dock to those obtained by the others by using a paired t-test (PTT).
We compare the paired RMSD dierences (for all protein complexes docked
by two algorithms ISE and another) under the assumption that the paired
dierences are independent and identically normally distributed.
2.4.5 Comparing CPU time
Variable computation times are the result of dierences in CPU, in algorithm
implementation and in other program specic issues. As both LGA and
ISE are parts of the same program, and most (>85%) computation time is
spent on energy evaluations, we use the number of energy evaluations as
an independent estimate of time performance. In order to enable a common
basis for comparing performance we changed the default output size for LGA
from 10 poses to 35. This size was chosen so that the average number of
energy evaluations using LGA ( 8.8 10
6
) would approximately equal the
average number of energy evaluations performed during ISE optimizations
( 8.6 10
6
).
2.4.6 Energy funnels
The existence of funnels in the energy landscape has been proposed for pro-
tein folding[109, 62, 100, 105] and has been further expanded for protein-
protein[99, 115] and protein-ligand recognition[96, 109]. It has been sug-
gested that the shape of such plots is in correlation with the nature of binding
between the molecules[62]. In the part of this work that deals with exible
ligand - rigid protein docking, I utilize the ability of ISE to generate large
Methods 54
populations of near-optimal solutions to estimate the energy landscape in the
vicinity of the minimum. For each docked complex, I construct an energy vs
RMSD plot.
2.5 Flexible protein docking
As stated above, protein exibility is introduced into this work as a series
of experiments that serve a proof of the concept that ISE is successfully
presenting protein exibility. Therefore, the experimental design is limited
to several typical cases and no statistical analysis of the results is performed.
The test cases were chosen so that the exible regions in the proteins are
limited to small and ones in proximity to the bound ligand.
The ability of ISE-dock to represent changes in the protein backbone
was tested using two structures of collagenase with inhibitors. In our repre-
sentation of backbone exibility, we follow other studies that produce mul-
tiple backbone conformations and dock a ligand to each of them, in order
to identify the protein conformation to which the ligand would preferentially
dock. However ISE has been shown to produce higher quality backbone con-
formations that are close to experimental. Docking to a protein with exible
side chains was tested on two systems: acetylcholinesterase (single side chain)
and trypsin (several side chains). In all cases, the structures chosen are from
results of X-ray crystal structure determination in the PDB and represent
real modications of the protein structures. All the selected complexes are
pharmaceutically relevant.
Methods 55
Docking process
Applying ISE-dock to exible backbone docking requires initial separation
of ligand from the receptor-ligand complex. In the next stage our loop pre-
dicting program (ISE-based) predicts conformations of exible protein frag-
ments. This program was developed in our group and has been successfully
applied [86, 76]. During the search for optimal backbone conformations, ISE
samples exible fragments by probing dipeptides conformations. Dipeptide
selection is performed according to the given sequence. The conformations
are evaluated by an energy function that combines penalties for deviations
from peptide geometry and interactions between the fragment and the rest
of the protein. This process results in a set of conformations sorted by the
value of the scoring function. Side chains are represented as centered on Ca
in the evaluation of interactions. The side chains are added to each backbone
conformation in a subsequent step using the program SCAP[114].
Main chain Following the generation of backbone conformations of a loop
or protein fragment, the ligand is docked into each of a selected set of pro-
tein conformations. This set is limited in number due to computational
restrictions and also due to the energy gap from the lowest energy (global
minimum) conformation. It is reasonable to assume that an energy loss of
5 kcal/mol may be compensated by interactions with a ligand. Thus, we
used a threshold of 5 kcal/mol for backbone conformations above the global
minimum in order to pick a small set out of a much larger one, produced
by ISE. In each docking experiment, 4096 conformations were generated as
a result of applying ISE to the exible ligand positions, with each protein
Methods 56
conformation. The sets for all protein conformations are merged and sorted,
and only the best 4096 conformations remain for nal examination.
Side chains To perform ligand docking that includes exible side chains,
an initial decision must be made, which side chains will be treated as ex-
ible. Those specic side chains are then combined with the ligand, thus
becoming exible as the ligand is. Preparation of structures for computations
follows the one described description in section 2.4.2.
2.5.1 Protein Backbone Flexibility
Test Case of Collagenase
General
The protein family of Matrix Metalloproteinases (MMPs) is responsible for
metabolizing the macromolecular components of extracellular matrix. The
collagenase subfamily (MMP1, 8 and 13) enzymes are responsible for
cleaving brillar collagen. This cleavage is a key process in rheumatoid and
osteoarthritis[59]. The crystal structure of collagenase-3 (MMP13) with
RS-130830 (dipenyl-ether sulphone based hydroxamic acid) has been solved
with a resolution of 2.4

A (PDB code: 456c)[59]. Fibroblast collagenase-


1 (MMP1) in complex with RS-104966 (N- hydroxy- 2- [4- (4- phenoxy-
benzenesulfonyl)- tetrahydropyran- 4 yl]- acetamide) has been solved with
1.9

A resolution (PDB code: 966c)[59]. The ligands RS-130830 and RS-


104966 are chemically similar, with the only additional substitution on one of
Methods 57
Table 2.2: Anities to collagenase
RS-130830 RS-104966
PDB complex 456c 966c
K
i
(nM)
MMP1 590 23
MMP13 0.52 0.13
the two phenyl rings. The molecules have dierent specicity proles towards
MMP1 and 13 (Table 2.2).
The two proteins share 59% sequence identity, and have very similar 3D
structures. The major dierence between the structures of these two proteins
is in a few characteristics (orientation, amino acid contents and length) of a
single loop: residues 243255 (13 amino acids) for PDB structure 456c and
residues 239249 (11 amino acids) for structure 966c. This, together with
the residue at position 218 (according to SWISS-PROT numbering of 456c)
form the specicity pocket the sub-site that is responsible for collagenase
specicity as well as the specicity of quite a few other MMPs. Figure 2.2
presents a structural alignment of the two collagenases. As one may see, the
backbone traces of the two dier mainly in fragments Gly 248 Met 253 in
456c (6 amino acids) and Ser 244 Leu 247 in 966c (4 amino acids). These
fragments belong to the S1

specicity pocket.
Methods 58
Figure 2.2: Structural alignment of 456c and 966c. Backbone traces of the proteins are
color coded according to the distance (in

A) between the aligned backbone atoms. RS-
130830 (red) and RS-104966 (green) are shown as sticks models.
Comparisons and their analysis
Our exible backbone docking involves initial prediction of loop positions,
rigid docking of the ligand to these multiple loops and then combining the
results into a single set. The computational eort that is involved in this
multistep methodology is much greater than the computational cost of rigid-
protein docking. Due to the need to apply a few programs in order to obtain a
set of nal results, the eect of the additional investment of CPU time cannot
be assessed nor isolated. Therefore we do not compare exible-backbone
docking to rigid protein docking.
Protein backbone conformations of fragments or loops are produced by
applying ISE to the structure of the protein in the protein-ligand complex,
without the presence of ligand. To evaluate the results, we compare the frag-
Methods 59
ment conformations to the original loop/fragment conformation in the com-
plex. We compare by measuring backbone atoms deviations (using RMSD).
For the ligand, its predicted position is compared to the one observed crystal-
lographically using RMSD of heavy atoms. Ligand RMSD of the top scored
conformation, best RMSD in top 20 and in all available solutions are re-
ported. Ideally, RMSD of all movable atoms (protein backbone, side chain
atoms and the ligand) needs to be calculated. To calculate RMSD over this
set of atoms, one needs to take into account the numerous local axes of sym-
metry present in any protein-ligand complex. Phenyl rings, carboxylate and
guanidine groups are examples of substructures that contain such axes. Cor-
rect accounting for symmetry axes is a complex combinatorial problem with
an exponential complexity. Due to the preliminary nature of exible protein
docking experiments and in order to simplify the process of evaluation, I de-
cided to use two values simultaneously: RMSD of the ligand and RMSD of
protein backbone atoms.
2.5.2 Flexibility of a single side chain
Test case of acetylcholinesterase
General
Acetylcholinesterase (AChE) plays an important role in regulating the func-
tions of the central and peripheral nervous systems. This enzyme cleaves
acetylcholine, which is secreted by neuron vesicles into the synapse that sep-
arates the vesicle and the membrane of the next cell in line. Acetylcholine
encounters receptors on that membrane and activates the continuation of the
Methods 60
Figure 2.3: Cross section of AChE complexed with acetylcholine (PDB code: 2ace), colored
by (A) partial charge of the atoms and (B) by the residue type (colored by PyMol):
hydrophobic (GILMPV) white, aromatic (FWY) magenta, semipolar (C) yellow,
polar (HNQST) cyan, positive (KR) blue, negative (DE) red. Acetylcholine is
colored blue in both panes.
neuronal transmission. AChE cleaves acetylcholine in a two step reaction into
choline and acetate, thus terminating the signal. The catalysis occurs in a
very deep, electron-rich, binding pocket, which is also called the gorge (see
Figure 2.3). The protein structures of AChE is complexed with Huperzine
A (PDB code: 1vot) and with Aricept (PDB code: 1eve) dier mainly in
the position of the side chain of one residue, Phe 330 (Figure 2.4)[97]. When
AChE is complexed with Huperzine A (1vot), Phe 330 adopts the confor-
mation that keeps the binding gorge closed. When, on the other hand, the
bulkier Aricept molecule is present in the complex (1eve), Phe 330 adopts
a conformation that allows the entry of this bigger ligand to the binding
pocket. The dierence between the two conformations in the
1
angle (1eve
105.3
o
; 1vot 58.9
o
).
Comparisons and their analysis
To asses the performance of ISE-dock, results of rigid-protein docking and
cross-docking to AChE (1eve and 1vot) are compared to those obtained by
Methods 61
Figure 2.4: AChE complexed with Huperzine A (PDB code: 1vot, light gray) and with
Aricept (PDB code: 1eve, dark gray). The ligands and Phe 330 side chains from both the
complexes are highlighted using sticks.
exible docking. A total of 4 cross docking experiments are performed with
each method. The comparison is done using RMSD of the ligand only (heavy
atoms) due to the very strong similarity between the backbones of 1eve and
1vot, diering by only RMSD 0.2

A. In addition, RMSD of all movable


heavy atoms is calculated (including side chains). This allows an evaluation
of our docking by the common acceptable RMSD criteria, but do not compare
rigid and exible docking.
As with the rigid docking, we use the three criteria of (1) top ranked
solution, (2) best out of top 20 poses, and (3) best available pose to compare
to the crystallographic structure.
Methods 62
2.5.3 Flexibility of several side chains
Test case of trypsin
General
Trypsin is a serine protease in the gastrointestinal tract, where it is respon-
sible for protein hydrolysis. It is a very well studied protein with numerous
available 3D structures in the PDB. Due to its role in the digestive system,
trypsin is not very selective as it is supposed to bind and cleave a very broad
range of proteins and peptides. Due to this nonspecic binding, many struc-
turally diverse small molecules bind to trypsin. A set of 10 protein-ligand
structures was chosen as a data set for this study. Their PDB codes are:
1ppc, 1pph, 1tng, 1tnh, 1tni, 1tnj, 1tnk, 1tnl, 1tpp and 3ptb. This data set
is similar (but not identical) to that used by Kramer at al. in their evalu-
ation of the FlexX program[49]. In the data set used here, three residues
(Leu 99,Gln 192, and Gln 221) demonstrate conformational changes of their
side chains over this data set. These residues were identied using visual
examination of the binding pockets of all the proteins in the set. Figure 2.5
summarizes the trypsin set.
On the average, the trypsin data set contains 4.1 rotatable bonds per
complex due to the dierent ligand in these complexes. The addition of
three exible side chains results in more than a three fold rise in the number
of rotatable bonds (average of 12.4 bonds per complex). For each rotatable
bond, ISE-dock has to consider 60 possible angles, one for each 6
o
. This
leads to a dramatic exponential increase in the problem size ISE-dock has
to consider: 10
16
combinations in rigid-protein docking vs 10
31
combina-
Methods 63
Figure 2.5: Trypsin data set. 10 superimposed trypsin structures: 1ppc, 1pph, 1tng, 1tnh,
1tni, 1tnj, 1tnk, 1tnl, 1tpp and 3ptb. The ligand molecules and the residues that are
treated as exible are shown as sticks. The remaining parts of the proteins are shown as
backbone trace.
tions after the inclusion of protein exibility. Due to this increase of problems
size, air comparison of the results of exible docking to those obtained by
rigid docking is problematic. Thus, in this proof of concept examination, the
results are not compared to those obtained by rigid docking, but evaluated
as described below.
2.5.4 Comparisons and their analysis
Having 10 trypsin-ligand complexes, it is possible to construct a 10x10 cross
docking matrix. The values of RMSD (as calculated over all the movable
heavy atoms) of the top scored solution, the best RMSD over the top 20
poses and the best available RMSDs are reported and analyzed.
Chapter 3
Flexible ligand rigid protein
docking
Flexible ligand - rigid protein docking was compared between ISE and other
algorithms by assigning the results to dierent RMSD threshold bins. Dock-
ing results are summarized in Figure 3.1. In this table, three criteria for
comparing the methods are presented: RMSD of top scoring poses, best
RMSD in top 20 poses, best RMSD in all available poses. The rst criterion
assumes that the scoring function, which is related to energies, is an exact
measure of stability, therefore concentrating on the best scored results. The
second criterion assumes that the scoring function may not be able to dis-
tinguish between the best RMSD and some other poses, limiting those to
the best 20, by energy. The third criterion extends this criterion to a much
larger number of poses. The table presents the minimal RMSD for the set of
81 proteins, the maximal RMSD, its mean, median and standard deviation
and, nally, its t-test for ISE with respect to any of the other algorithms.
64
Flexible ligand rigid protein docking 65
Table 3.1: Summary of docking results by ISE, LGA, Glide and GOLD.
RMSD of top pose Best RMSD in top 20 poses Best RMSD
in all avail-
able poses

ISE LGA Glide GOLD ISE LGA Glide GOLD ISE LGA
Minimum 0.52 0.39 0.3 0.41 0.25 0.31 0.3 0.34 0.2 0.31
Maximum 5.99 5.95 10.63 10.19 2.46 3.65 10.36 6.35 1.64 2.74
Mean 1.73 1.9 2.57 3 0.98 0.99 1.49 1.56 0.73 0.89
Median 1.33 1.55 1.63 2.17 0.84 0.81 1.11 1.1 0.69 0.72
SD 1.14 1.39 2.58 2.44 0.51 0.62 1.44 1.26 0.37 0.5
P(PTT) 0.09 0 0 0.46 0 0 0.01 0.006

ISE-dock 4096 poses, AutoDock 35 poses.


The detailed results for each of the complexes are presented as table C.1 in
Appendix C. In the analysis of Figure 3.1 and additional gures presented
below, I demonstrate that the performance of ISE-dock is in many aspects
better than the performance of several well established docking programs.
One should note that the results for Glide and GOLD that were obtained
by Perola et al[84] and are reported in this paper, dier slightly from those
published in the original work, due to the fact that they were obtained using
100 publicly available and 50 internal company structures[84], as opposed to
the subset of 81 publicly available structures in this report.
Flexible ligand rigid protein docking 66
3.1 Top scoring poses
Figure 3.1 presents the fraction of top scored poses in the full set of docking
experiments, in a given RMSD threshold from the crystal structures. It
may be seen that ISE-dock achieves better results than the other three
programs when considering 50% of the complexes or more. ISE did not
dock any complex with top scored solution below 0.5

A. In the remaining
threshold values, ISE and LGA outperform (with various degrees) Glide and
GOLD with respect to the number of structures with top scored solutions
below the corresponding threshold. For thresholds above 1.0

A, there is a
slight advantage of ISE over LGA, which increases for larger threshold values.
About 70% (65% for LGA) of the top scoring structures are found by ISE-
dock to be under 2.0

A RMSD from experiment and nearly 85% (76% for


LGA) are found under 3.0

A.
The mean and median RMSD values for the top scored poses, as well as
the standard deviations, are better with ISE than LGA, Glide or GOLD.
The PTT for ISE results vs the others are: LGA: P=0.09, Glide: P=0.002
and GOLD: P< 0.001. P is the probability that the dierence between the
algorithms is random, as calculated by PTT.
Top scoring poses are complexes of best interaction energy, and are ex-
pected to show the lowest RMSD from experimental. However, they are
frequently found to have larger RMSD values due to (1) limited inclusion
of exibility and (2) limitations of the scoring functions, which compromise
between speed and quality. Still, these scoring functions are expected to be
good enough to identify the best answers among the top results for a docking
Flexible ligand rigid protein docking 67
Figure 3.1: Top single docking poses at dierent RMSD bins with respect to crystal
structures, 4 dierent programs. Results for Glide and GOLD were obtained by Perola et
al.[84].
experiment, and the number 20 was chosen[84] to probe for such best RMSD
results.
Flexible ligand rigid protein docking 68
3.2 Top 20 poses
Comparison of top 20 poses demonstrates that ISE-dock outperforms both
Glide and GOLD and shows better or similar performance, compared to
AutoDock s LGA. The mean and the median RMSD values of the best out
of the top 20 poses obtained by ISE are similar to those obtained by LGA
and are better than those obtained by the other two algorithms. Pairwise
comparison shows that the performances of ISE and LGA on the top 20 poses
are identical (P=0.46). Examination of the best 20 docking poses shows that
ISE is clearly better than Glide and GOLD, with a probability P0.001
with respect to any of these two (see Figure 3.1). Figure 3.2 demonstrates
that LGA and ISE have an advantage over Glide and GOLD for the top 20
poses in all RMSD ranges. ISE results for 0.5

A, 2.0

A and 3.0

A thresholds are
better than those of LGA. ISE alone produced at least one 3.0

A or better
solution among the top 20 poses in the entire test set (100.0% compared to
97.5%, 90.1% and 87.6% for LGA, Glide and GOLD, respectively). In 98%
of the examined molecules, ISE produced solutions that are closer than 2.0

A
from experimental. Examination of the top 20 poses is most meaningful for
comparing between the programs, as it appears to indicate that the sampling
conducted by ISE-dock is indeed more thorough than the sampling of the
other programs.
Flexible ligand rigid protein docking 69
Figure 3.2: Top 20 docking poses, RMSD to corresponding crystal structures. Results for
Glide and GOLD were obtained by Perola et al.[84].
3.3 Solution space coverage
ISEs ability to generate very large populations of near-optimal solutions re-
sults in much better coverage of solution space near the (global) minimum.
This is borne out by comparing best RMSD in the full set of solutions by
ISE and LGA in similar CPU time (4096 and 35 solutions, respectively). The
population obtained in standard runs of ISE is larger than that obtained by
LGA by more than a 100-fold. This increases signicantly the chance of
nding docking poses with lower RMSD values. It is reasonable to compare
populations that dier that much in size, as we show in the discussion of
alternative binding modes in the results section. I could not compare ex-
tended docking populations for Glide and GOLD, as no such data were
reported. It should be emphasized that ISEs 4096 solutions in this case, and
any number of solutions in other cases, are not merely poses encountered
Flexible ligand rigid protein docking 70
during the random search, but are the best ones following the probing of the
whole space. The PTT probability value for comparison of the two docking
sets is P=0.006. ISE results are better with respect to all the terms in the
ve-number summary (minimum, maximum, average, median and standard
deviation) of the best RMSD in the entire solution set (Figure 3.1). When
examining the percentage of complexes with at least one solution below a
certain threshold, as depicted on Figure 3.3, the most prominent dierence
between the algorithms is at 0.5

A: 32.0% vs 17.3% in favor of ISE. This dif-


ference drops down to 3.7% in favor of ISE at 2.0

A. All the 81 complexes


were docked by ISE with at least one solution below 2.0

A. LGA succeeded
to dock all the complexes with at least one solution below 3.0

A. These nd-
ings suggest that populations docked by ISE, combined with a more accurate
scoring technique, may lead to better detection and identication of relevant
docking results.
The ISE docking population (comparing by CPU time, 4096 top solu-
tions of ISE vs 35 of LGA) is much more diverse in its poses than that
produced by LGA. We clustered the poses using Sequential Leader Clus-
tering algorithm[36], with a default distance criterion of 1.0

A. The average
number of clusters for the 81 molecules is 1870 for ISE and 14 for LGA.
Flexible ligand rigid protein docking 71
Figure 3.3: Top available docking poses produced in equal CPU times, RMSD to corre-
sponding crystal structures. The numbers of poses are 4096 (ISE) and 35 (LGA).
3.4 Time performance
We used the time performance of ISE and LGA in order to choose approxi-
mately equal processing times and analyze the number of solutions obtained
in that span of time. The average time needed to obtain 4096 docking so-
lutions on an Intel
R
Xeon
TM
3 GHz computer, using ISE with the current
settings, was about 7.5 minutes. The average time needed to obtain 35 so-
lutions using LGA was about 8.3 minutes. As mentioned above, the time
required by LGA is linear with the number of solutions. Thus, it is expected
that more than 16 hours are required to obtain 4096 docking solutions with
LGA. For AutoDock, it has been recently suggested to increase the reli-
ability of results by obtaining more solutions and by increasing the number
of evaluations[66]. Such an increase has a substantial toll in computer time,
which is absent in ISE. We could not compare the time performance of ISE-
Flexible ligand rigid protein docking 72
Figure 3.4: Number of iterations before switching to exhaustive search as a function of
initial combinatorial size (number of initial combinations).
dock to those of Glide and GOLD. Results for the quality of the solutions
with these programs are reported here as they appear in Perola et al.[84].
The initial number of total possible combinations for ISE docking ranged
from 1,012 to 1,034 depending on the number of ligand rotatable bonds,
ranging between 2 and 14. The number of iterations (between 50 and 76
for dierent molecules) needed to reduce the size of the problem below the
threshold (105 combinations for switching to exhaustive computations) is ap-
proximately linear with respect to the logarithm of the initial problem size.
The graph that describes this relationship is shown in Figure 3.4. Based on
that linearity, it should be possible to extend the number of variables and
values to include protein side chains, main chain angles as well as additional
degrees of freedom.
Flexible ligand rigid protein docking 73
3.5 Multiple binding modes
A growing body of data supports the existence of multiple binding modes of
ligands to receptors[18, 9, 44, 91, 57, 27, 35, 81]. In order to learn about mul-
tiple binding modes from ISE-dock, the shape of energy landscapes around
minima in energy vs RMSD graphs of ISE results is examined. These plots
may be roughly divided by visual examination into three groups: those with
one distinct funnel, those with multiple funnels and those with no distinct
funnel. It has been suggested[62] that existence of a single canyon at the
bottom of the energy landscape corresponds to a stable structure, multiple
minima might indicate the existence of multiple binding modes, and rugged
and unshaped energy vs RMSD plots may be the result of a looser or non-
specic binding, induced t phenomena or domain swapping.
Figure 3.5A shows a representative of a few complexes that appear on
energy vs RMSD plots with a single funnel-like region (PDB code 1yds).
As expected, in this case, the docking solutions are structurally close to
the crystallographic pose and to one another (Figure 3.5B). Figure 3.6A
demonstrates an energy vs RMSD plot with two funnels (PDB code 1bqo),
while Figure 3.7A shows such a plot with no distinct funnel (docking results
of 1hpv). As one may see from Figure 3.7B, there are at least two predicted
binding modes for this complex, which is in agreement with our previous
suggestion. In Figure 3.7B, the ligand positions are spread over a large
conformational variation. Energy vs RMSD plots of the entire data set of
81 complexes after a single docking run are presented in Figures C.1 C.7
(Appendix C.2).
Flexible ligand rigid protein docking 74
Figure 3.5: A: Energy vs RMSD plot for docking populations of the complex 1yds obtained
with ISE, showing a single distinct funnel. B: the same plot for 35 solutions obtained by
LGA. The plots are shown using the same scale. C: The rst 35 solutions (dark lines)
docked by ISE vs the ligand in the crystal (gray sticks). Receptor residues with at least
one atom within 5.5

A of the ligand are shown as light gray cartoon. All structures in this
work were visualized using PyMol[15].
In 27 cases (34%), the span of energy for 4096 solutions between the
global minimum (GM) and docking solution of highest energy is less than 5
kcal/mol. In 50 cases (61%), all 4096 solutions are within 5 15 kcal/mol
from the GM, and in only 4 cases (5%), the energy spread is larger than 15
kcal/mol. Figure 3.8 shows the cumulative percentage of solutions (for 81
complexes, each with 4096 poses) with increasing energy gaps from the GM,
thus clarifying that most conformations are close to the GM. These 4 plots
Flexible ligand rigid protein docking 75
Figure 3.6: A: Energy vs RMSD plot for docking populations of the complex 1bqo obtained
with ISE, showing two distinct funnels. B: the same plot for 35 solutions obtained by LGA.
The plots are shown using the same scale. C: The crystal structure of the ligand (gray
sticks) and the rst 35 solutions (dark lines) docked by ISE.
with high energy minima (1fm9, 1hpv, 1qbu, 3std), have (as 3.7A) no distinct
funnel. The docking poses of these 4 complexes have no single binding mode,
but are disperse. The main feature of these complexes is the deeply buried
ligands in binding pockets (data shown for 1hpv, Figure 3.7).
Flexible ligand rigid protein docking 76
Figure 3.7: A: Energy vs RMSD plot for docking populations of the complex 1hpv obtained
with ISE, showing a scatter of the results. B: the same plot for 35 solutions obtained by
LGA. The plots are shown using the same scale. C: The crystal structure of the ligand
and the rst 35 solutions docked by ISE.
Flexible ligand rigid protein docking 77
Figure 3.8: Cumulative fractions (Y-axis) of 81 ISE docking complexes with an energy
span between the global minimum of each (pose number 1) and the other 4095 poses,
below the given threshold (X-axis).
Flexible ligand rigid protein docking 78
3.6 PDB data supports distinct funnels
Twenty four plots with multiple distinct funnels are found in our test set
(1azm, 1bqo, 1cim, 1eve, 1f4e, 1fm6, 1h1p, 1h9u, 1hdq, 1if7, 1iy7, 1jsv, 1k7e,
1kv1, 1qhi, 1qpe, 1r09, 1uvt, 1ydr, 3cpa, 3std, 4dfr and 5std). Ligands of
two of the twenty four complexes are present in the PDB in complexes with
other proteins (5-acetamido-1,3,4-thiadiazole-2sulfonamide from 1azm in 9
complexes; 6-O-cyclohexylmethyl guanine from 1h1p in 2 complexes) but
display similar binding modes in all of them. One complex (3cpa) contains
glycyl-tyrosine as a ligand, which is not searchable in the PDB as it is not rec-
ognized as a hetero compound. Two complexes contain related structures
same or similar proteins with dierent ligands (1f4e, 1kv1). Of these two,
I would like to concentrate on p38 MAP kinase that was crystallized with an
inhibitor (PDB code: 1kv1; ligand HET ID: BMU)[80]. Another structure
of the same protein exists in the PDB bound to a structurally dierent lig-
and (PDB code: 1kv2, ligand HET ID: B96)[80]. Figure 3.9 demonstrates
that those ligands bind in two dierent modes. The ligand in 1kv2 is much
larger (527 g/mol) than the ligand in 1kv1 (306 g/mol). An additional no-
ticeable dierence between the two ligands is that the toluyl group of 1kv2
is positioned in the place of the CH
2
pyrrole group of the ligand in 1kv1.
The energy vs RMSD plot for the 1kv1 complex (Figure 3.10) displays
three distinct funnels with solutions ranked 1, 222 and 270 at their bottom
(marked d1, d222 and d270). These three poses are summarized in Figure 3.2.
As may be seen in Figure 3.11, the top scored pose is close to the crystal
structure position (RMSD of 1.37

A). In the d222 solution (Figure 3.12) the


Flexible ligand rigid protein docking 79
Figure 3.9: Complexes 1kv1 (light gray) and 1kv2 (dark gray) superimposed using back-
bone atoms. The ligands are shown as sticks and backbone of closest (within 5.5

A)
residues to the ligand are shown as PyMol cartoons.
ligand is positioned in reverse to d1, while in d270 it is positioned so that
chlorophenyl is in the position of toluyl in 1kv2 (Figure 3.13). Generally,
LGA is capable of producing cluster-like structures when plotting the calcu-
lated solution energy vs RMSD from a single structure even when congured
to predict relatively small amount of docking solutions (see for example dock-
ing solutions for complexes 2cgr, 3cpa or 4dfr in section C.2 of the Appendix).
Nevertheless, in the case of 1kv1, the points on Figure 3.10B, representing
35 LGA solutions, are all clustered around a small well dened region in the
E vs RMSD plot and do not suggest any alternative binding modes.
It has been proposed that thyroxine binds to Transthyretin in two an-
tiparallel modes[27, 35, 81]. ISE-dock and AutoDock s LGA were ap-
plied to re-dock the thyroxine ligand from its crystal structure complex with
Flexible ligand rigid protein docking 80
Figure 3.10: Energy vs RMSD plot for docking populations obtained by ISE (A) and LGA
(B) of the complex 1kv1. The plots are shown using the same scale. The best single ISE
solutions at each of the three funnels have ranks 1, 222 and 270 and are marked with
arrows.
Figure 3.11: The best ISE-dock solution for 1kv1 (sticks). The crystal structures of 1kv1
and 1kv2 ligands are shown for comparison (lines). 1kv1 is colored according to: C cyan,
N blue, Cl green. 1kv2 is colored according to: C yellow, N blue, O red.
Human Transthyretin (PDB: 2rox). The energy vs RMSD plot for the pop-
ulation obtained by ISE shows at least two distinct funnels with docking
solutions ranked 1st and 2nd (marked d1 and d2, respectively) at the bot-
tom of the energy funnels. Figure 3.14B shows that the two solutions are
indeed antiparallel. The solutions by LGA do not suggest an alternative
Flexible ligand rigid protein docking 81
Figure 3.12: ISE-dock solution for 1kv1, ranked 222 (sticks). The crystal structures of
1kv1 and 1kv2 ligands are shown for comparison (lines). The coloring scheme is identical
to that of Figure 3.11
Table 3.2: Binding modes of 1-(5-tert-butyl-2-methyl-2h-pyrazol-3-yl)- 3-(4-chloro-
phenyl)-urea (from 1kv1)
Pose E(kcal/mol) RMSD(

A)
1 -10.51 1.37
222 -9.13 3.95
270 -8.84 4.69
binding mode. Figure 3.14A shows the energy vs RMSD plots for ISE and
LGA docking solutions of 2rox.
Flexible ligand rigid protein docking 82
Figure 3.13: ISE-dock solution for 1kv1 solution ranked 270 (sticks). The crystal structures
of 1kv1 and 1kv2 ligands are shown for comparison (lines). The coloring scheme is identical
to that on Figure 3.11
In AutoDock, the lower number of solutions supplied by LGA compared
to ISE in similar CPU time provides fewer suggestions for ligand binding
modes. This is further emphasized by the smaller number of clusters of LGA
docking compared to ISE-dock, which covers solution space better than
LGA in a similar CPU time. Large ISE populations may thus improve upon
the imperfections in the energy functions.
Flexible ligand rigid protein docking 83
Figure 3.14: Energy vs. RMSD plot for docking populations of the complex 2rox, obtained
by ISE (A) and LGA(B). The best single ISE solutions at each of the two funnels have
ranks 1 and 2 and are marked with arrows. C: Antiparallel docking solutions ranked 1 and
2 for 2rox (green and magenta sticks respectively). The carbons in the crystal structure
of thyroxine are shown thin sticks colored cyan. The backbone of closest (within 5.5

A)
residues to the ligand are shown in PyMol cartoon representation colored cyan.
Chapter 4
Flexible Ligand Flexible Protein
Docking
4.1 Protein backbone exibility test case of
collagenase
The coordinates of MMP13 and MMP1 (456c and 966c) were obtained
from the PDB. All the water molecules and metal ions, except for the cat-
alytic Zinc were removed. The ligands were separated from the protein and
saved in a separate le. As the 456c structure contains two identical chains,
only one of them (A) was used. Alternate positions of the conformation-
ally exible loops (residues 248 253 for 456c and residues 244 247 for
966c) were produced by ISE. As any ISE implementation produces multi-
ple near-optimal solutions, only the conformations that dier from the best
scored one (global minimum) by not more than 5 kcal/mol were chosen for
84
Flexible Ligand Flexible Protein Docking 85
the next step. For 966c, there were 31 such solutions and RMSD of backbone
atoms with respect to the crystallographic structure (over the exible region
only) ranged between 0.09

A and 0.33

A. In the case of 456c, only 5 solutions


with energy values of 5 kcal/mol above the global minimum were generated.
Their RMSD values were slightly higher that those of 966c and ranged be-
tween 0.59

A and 0.61

A. These RMSD values describe only the backbone of


the protein exible fragment (loop).
Loop generation is conducted without the presence of ligand. and side
chain conformations are generated by SCAP[114] using the optimized back-
bone conformations (section 2.5.1, page 58). Although the geometry of the
backbone in both the proteins is very close to that observed in the PDB struc-
tures, the prediction of side chains positions has been also performed with
no ligand presence. This might be the reason for the relatively high RMSD
values observed in this data set for the positions of the ligands (Figure 4.1).
In the top 20 docking solutions the best RMSD values ranged between 1.59

A
(456c-966c)
a
and 2.20

A (966c-456c). The top scored solutions had RMSDs


between 2.25

A and 3.49

A. Nevertheless, the docking results indicate that


ISE-dock has successfully included good docking poses in the nal docking
sets of all the four docking experiments. This conclusion follows the best
RMSD values in the entire docking populations of 4096 structures. RMSD
values are below 2

A in all cases. If the ligand and the protein originate from


the same complex, the prediction of ligands poses are even better: 1.33

A for
456c and 1.18

A for 966c. The fact that no solution with RMSD <1

A was
a
In this work, the names of cross docking experiments follow the [ligand name]-[receptor
name] template
Flexible Ligand Flexible Protein Docking 86
Table 4.1: Collagenase data set, best ligands RMSD (

A) in top 1, top 20 and all available


(4096) solutions. RMSD of the backbone from the crystal position of the corresponding
solution is also reported.
Ligand Receptor Top 1 Top 20 Top 4096
Ligand Backbone Ligand Backbone Ligand Backbone
456c 456c 2.25 0.61 1.59 0.61 1.33 0.61
966c 966c 2.92 0.28 2.09 0.20 1.18 0.21
456c 966c 3.49 0.13 2.14 0.20 1.61 0.27
966c 456c 2.75 0.61 2.20 0.61 1.76 0.61
found between the top 20 docking solutions is easily explained by two fac-
tors: (1) the scoring function used during the docking process is not capable
to distinguish between changes in the protein 3D structure and (2) the loop
structure was optimized with no ligand present in the binding side, while
the subsequent docking process did not allow protein accommodation to the
presence of the ligand.
Loop conformations were successfully predicted by ISE algorithm, in
terms of the backbone structure. Nevertheless, due to the small ranges in
backbone RMSD values, no conclusion about the ability of ISE-dock on its
own to discriminate between backbone positions could be done.
Flexible Ligand Flexible Protein Docking 87
4.2 Flexibility of a single side chain
Test case of acetylcholine asterase
The two AChE structures in this study dier in the side chain positions of
the residue Phe 330 (Figure 2.4)[97]. The values of
1
angles of 1eve and
1vot are 105.3
o
and 58.9
o
, respectively. The results of docking experiments
with AChE test set are presented in Table 4.2.
Rigid protein docking Rigid bound docking resulted in good accuracy:
RMSD values of top scoring solutions were 1.85

A for 1eve and 0.86

A for 1vot.
The best RMSD values among the top 20 solutions were 0.63

A and 0.86

A
for 1eve and 1vot, respectively. When no protein exibility was allowed,
cross docking experiments, as expected, gave worse results than the native
(bound) docking. A decrease in the quality of the results was observed when
Aricept (1eve), the larger of the two ligands, was cross-docked into the protein
structure that was solved in complex with Huperzine A(1vot). The RMSD
value for top ranked solution in that case was 2.91

A. However, the closest


ligand pose to the experimental structure (pose #813 out of 4096 poses) had
an RMSD value of 1.43

A.
Flexible protein docking Cross docking When protein side chain (Phe
330) exibility was allowed, cross docking of Aricept resulted in minor im-
provements of RMSD values in the three tested parameters. On the other
hand, in the cross docking of Huperzine A, the top 1 and the top 20 solu-
tions had worse RMSD values, compared to those obtained by rigid cross
Flexible Ligand Flexible Protein Docking 88
Table 4.2: Results of Acetylcholineesterase cross docking experiments (RMSD[

A]). The
results are reported for the best scored solution (Top 1) and the best RMSD values out
of the top 20 and out of all the available solutions (Top 4096). The ligand structures are
listed in rows and the protein structures are listed in columns.
Rigid docking Flexible docking
Ligands position All movable atoms
1eve 1vot 1eve 1vot 1eve 1vot
Top 1 1eve 1.85 2.91 1eve 2.17 2.12 1eve 1.95 1.85
1vot 1.09 0.86 1vot 2.60 0.72 1vot 2.28 0.70
1eve 1vot 1eve 1vot 1eve 1vot
Top 20 1eve 0.63 1.97 1eve 1.87 1.59 1eve 1.55 1.19
1vot 1.03 0.81 1vot 2.47 0.70 1vot 2.14 0.68
1eve 1vot 1eve 1vot 1eve 1vot
Top 4096 1eve 0.39 1.43 1eve 1.29 1.40 1eve 0.48 0.85
1vot 0.65 0.54 1vot 0.45 0.24 1vot 0.52 0.37
docking. However, a much closer to experimental ligand pose was found for
Huperzine A among the entire docking solution, with an RMSD of 0.45

A,
compared to 0.65

A that was obtained without protein exibility. Examining


the predicted positions of all the movable atoms, one may nd that high
quality results were included in the nal docking sets of all four protein-
ligand combinations. This conclusion emerges from the RMSD values of the
closest solution out of 4096 available ones 0.85

A for 1eve-1vot and 0.52

A for
1vot-1eve cross-docking. On the other hand, the top solution and the top
20 solutions in the cross-docking cases relatively of high RMSD. Figure 4.1
demonstrates the results of unbound docking for the AChE data set.
Flexible Ligand Flexible Protein Docking 89
Figure 4.1: The best available docking solution for (A) 1eve-1vot and (B) 1vot-1eve in
unbound (cross-) docking experiments. The docking solutions for all the movable atoms
are shown as lines and the crystal structures are shown as sticks. The protein structures
are shown as backbone trace.
Flexible protein docking Bound docking When exibility of Phe330
was included, the quality of bound docking results for Aricept (1eve-1eve)
were worse, compared to those obtained without protein exibility. Ligands
RMSD values for the top scored solution, the best out of top 20 and the
best available solution were respectively 2.17

A, 1.87

A and 1.29

A. In the case
of Huperzine A bound docking (1vot-1vot), there was a slight improvement
in the prediction of ligand position: 0.72

A vs 0.86

A for best scored pose,


0.70

A vs 0.81

A for best out of top 20 solutions and 0.24

A vs 0.54

A for best
available solution. The decrease in quality of bound docking results upon the
introduction of exibility (as was observed in the case of 1eve-1eve), can be
related to the increase in problem complexity. On the other hand, Phe330
exibility during docking of Huperzine A into a closed pocket (1vot-1vot) may
have solved minor clashes and as a result, gave in better results. Figure 4.2
illustrates the results of bound docking for the AChE data set.
Flexible Ligand Flexible Protein Docking 90
Figure 4.2: The best available docking solution for (A) 1eve-1eve and (B) 1vot-1vot in
bound docking experiments. The docking solutions for all the movable atoms are shown
as lines and the crystal structures are shown as sticks. The protein structures are shown
as backbone trace.
4.3 Flexibility of several side chains Test case
of trypsin
The RMSD values of torsional angles of the three residues that were treated
as exible in this work are listed in Figure 4.3. The structural dierences
between the proteins along the data set (in terms of torsional RMSD values)
range from 2.7
o
(1tng 1tnh) to 62.1
o
(1ppc 3ptb).
Cross docking of the 10 PDB structures resulted in 100 dierent docking
experiments. The detailed results of all the experiments are listed in Ap-
pendix D. RMSD of top scoring poses, the best RMSD in top 20 poses and
the best RMSD of all the available poses are reported and analyzed in Ta-
ble 4.4 and Figure 4.3. These results are assigned to RMSD threshold bins.
The bins are identical to the ones that were used in the rigid protein docking
experiments (Section 4.2, page 87).
The overall results of cross docking over the trypsin data set are good.
Contrary to the intuitive expectation, the RMSD values over the diagonals
Flexible Ligand Flexible Protein Docking 91
Table 4.3: Torsion RMSD (in degrees) of exible residues in the trypsin data set
1ppc 1pph 1tng 1tnh 1tni 1tnj 1tnk 1tnl 1tpp 3ptb
1ppc 0.0 43.4 42.6 42.1 43.0 40.7 40.3 41.0 61.5 62.1
1pph 43.4 0.0 35.5 34.5 36.7 34.0 33.3 34.3 58.3 61.0
1tng 42.6 35.5 0.0 2.7 4.3 4.8 5.0 4.6 48.1 37.6
1tnh 42.1 34.5 2.7 0.0 3.5 4.5 3.7 3.3 48.6 37.9
1tni 43.0 36.7 4.3 3.5 0.0 6.9 6.1 4.2 48.0 36.9
1tnj 40.7 34.0 4.8 4.5 6.9 0.0 2.8 4.4 50.6 40.9
1tnk 40.3 33.3 5.0 3.7 6.1 2.8 0.0 3.3 48.8 39.6
1tnl 41.0 34.3 4.6 3.3 4.2 4.4 3.3 0.0 49.2 39.8
1tpp 61.5 58.3 48.1 48.6 48.0 50.6 48.8 49.2 0.0 31.7
3ptb 62.1 61.0 37.6 37.9 36.9 40.9 39.6 39.8 31.7 0.0
Color map: 0 6 12 18 24 30 36 42 48 54 60
Figure 4.3: Top docking poses at dierent RMSD bins with respect to crystal structures
of Table 4.4 (bound docking) are frequently not the minimum ones. The
ligand from the 1tng complex is docked to all the protein structures with
Flexible Ligand Flexible Protein Docking 92
lower RMSD values, compared to the remaining ligands. On the other hand,
the ligand from 1tpp has the highest RMSD values. The detailed docking
results for the trypsin data set are listed in Table D.1 in the Appendix.
No protein-ligand combination could be docked with top scored solution
below RMSD of 0.5

A. In 5 cases, the entire docking set contained at least


one such a pose. In 17 cases, the top scored docking solution had and RMSD
below 2.0

A, in 74 cases 20 top scored solutions contained at least one pose


with RMSD<2.0

A. Solutions with RMSD<3.0

A were present in all the 100


protein-ligand combinations, while in 92 of them contained at least one such
a conformation among the top 20 docking solution.
Flexible Ligand Flexible Protein Docking 93
Table 4.4: Trypsin data set, RMSD values of top single docking poses and best docking
poses in top 20 and top 4096 solutions(

A), colorcoded
Receptor
Ligand 1ppc 1pph 1tng 1tnh 1tni 1tnj 1tnk 1tnl 1tpp 3ptb
Top 1
1ppc 1.7 3.4 2.8 3.0 2.6 2.0 2.7 3.0 3.4 2.5
1pph 3.9 4.7 4.6 4.3 4.5 3.9 4.3 2.8 4.5 3.6
1tng 1.0 1.1 0.5 0.6 1.0 0.8 0.9 0.6 1.0 1.0
1tnh 3.4 4.4 2.8 3.4 2.6 2.1 3.5 2.1 2.1 2.8
1tni 3.0 3.1 2.8 2.3 2.5 2.3 4.1 3.8 4.1 2.7
1tnj 4.0 3.2 3.8 2.5 3.2 2.3 2.8 2.5 2.2 3.5
1tnk 4.7 4.3 2.9 3.6 2.7 2.6 3.5 4.4 3.7 3.1
1tnl 4.5 2.1 3.0 2.6 2.7 1.8 2.7 3.3 2.1 3.4
1tpp 4.8 5.6 5.2 4.6 4.6 4.5 5.5 5.2 4.2 6.0
3ptb 3.0 3.3 3.2 3.7 3.2 2.7 3.5 3.1 3.1 2.8
Top 20
1ppc 0.9 2.5 1.4 1.9 1.5 1.3 1.1 1.3 1.6 1.7
1pph 2.4 2.1 2.0 2.2 2.0 2.0 2.7 2.0 2.5 2.2
1tng 0.6 1.0 0.4 0.5 0.6 0.5 0.6 0.5 0.6 0.8
1tnh 1.6 1.5 1.4 1.4 1.4 1.3 1.4 1.4 1.5 1.8
1tni 2.0 1.9 1.6 1.7 1.7 1.8 1.6 1.5 1.9 1.9
1tnj 2.2 1.8 1.5 1.5 1.7 1.4 1.4 1.5 1.7 1.7
1tnk 1.9 1.8 1.7 1.6 1.6 1.6 1.4 1.7 1.7 1.6
1tnl 1.9 1.5 1.3 1.3 1.3 1.3 1.4 1.3 1.4 1.4
1tpp 3.1 2.7 4.5 3.4 4.4 3.5 4.1 4.1 2.6 4.0
3ptb 1.8 1.6 2.6 2.3 2.4 2.3 1.9 2.4 1.9 2.2
Top 4096
1ppc 0.9 1.4 1.3 1.6 1.1 1.2 1.0 1.3 1.3 1.4
1pph 2.0 1.7 1.7 1.7 1.6 1.6 1.8 1.4 1.9 1.8
1tng 0.4 1.0 0.3 0.4 0.6 0.4 0.5 0.4 0.5 0.7
1tnh 1.3 1.1 1.1 1.2 1.2 1.1 1.1 1.2 1.2 1.3
1tni 1.3 1.4 1.2 1.4 1.3 1.2 1.2 1.1 1.5 1.2
1tnj 1.3 1.4 1.1 1.2 1.3 1.1 1.1 1.0 1.3 1.2
1tnk 1.6 1.4 1.4 1.2 1.4 1.3 1.2 1.4 1.4 1.4
1tnl 1.1 1.2 1.1 1.2 1.1 1.2 1.1 1.0 1.4 1.2
1tpp 2.1 1.9 2.2 1.8 2.5 2.0 2.0 2.7 1.8 2.8
3ptb 1.0 1.1 1.0 1.2 0.9 1.4 1.2 1.1 0.8 1.2
Color map: 0 0.3 0.6 0.9 1.2 1.5 1.8 2.1 2.4 2.7 3
Flexible Ligand Flexible Protein Docking 94
4.4 Discussion on protein exibility
Accounting for protein exibility introduces additional degrees of freedom,
but is a more realistic representation of biological systems. Until recently,
most major docking programs have been ignoring conformational variations
of side chains and backbone of the receptors[97]. Nevertheless, due to the
advances in the docking algorithms and in computational power, four out of
the ve most cited docking programs for year 2005[95] allow some extent of
protein exibility (Table 4.5). Therefore, any new proposed protein-ligand
docking program is expected to address protein exibility. Due to time
constraints, handling protein exibility by ISE-dock was implemented only
Table 4.5: Current status of protein exibility handling ISE-dock and in ve popular
docking programs (sorted according to the number of citations in 2005[95])
ISE-dock Explicit exibility of several side chains specied by the user.
Implicit handling of changes in the backbone using pregener-
ated populaions.
AutoDock No protein exibility in AutoDock ver.3. Recently released
ver.4 allows side chain exibility of selected residues
DOCK Protein exibility is not implemented
FlexX FlexX-Ensemble (formerly known as FlexE) an exten-
tion of FlexX. The exibility of the protein is represented
by an ensemble of structures, combined to a so-called united
protein description. It is possible to recombine elements from
dierent ensemble structures
GOLD Partial protein exibility, including protein side chains and
backbone exibility for up to ten user-dened residues
ICM Partial protein exibility, including protein side chains and
selected loops
Flexible Ligand Flexible Protein Docking 95
partially as a preliminary step before further development. In order to in-
clude protein exibility, the scoring function of AutoDock (and thus of
ISE-dock) was extended and applied to conditions that were not accounted
for during its construction and callibration. This application of the scoring
function in cases that dier dramatically from the ones that were used for
its construction and calibration was a trade o between the accuracy and
the speed of development in the proof of concept phase of development and
has direct impact on the quality of results. Although limited to small re-
gions, protein exibility handling in ISE-dock is successful and is another
demonstration of the ability of ISE do deal with multiple degrees of freedom
in protein-ligand docking problems. Indeed, docking experiments in all the
three test sets succeeded in producing high quality docking poses. The solu-
tions in the collagenase set contained ligand poses with ligand RMSD values
above 1.18

A (966c-966c), the AChE set contained solutions with RMSD for


all movable atoms of 0.48

A (1eve-1eve) and the trypsin set contained docking


solutoins with even lower RMSD: 0.3

A (1tng-1tng).
The main pitfall of the exible ligand exible protein docking using
ISE-dock is the scoring function. The original energy function does not
account for changes in the 3D structure of the protein. Implicit protein
exibility (collagenase data set) involves combining solutions of docking a
ligand into dierent protein structures. Explicit handling of changes in the
protein 3D structure during the docking process involves transferring atoms
from the protein to the ligand and exclusion of C

atoms of the exible


residues from the scoring scheme. The quality of top scored solutions is
heavily biased by the scoring function. As one may see from the results
Flexible Ligand Flexible Protein Docking 96
for the collagenase data set (Section 4.1, and, to a higher extent, from the
AChE and trypsin sets (Sections 4.2 and 4.3), poses that are very close to
those of the crystal structures are always sampled in the nal docking sets,
but are not scored well. Rescoring the docking solutions with or without a
post-docking processing step may hopefully improve the predictive ability of
ISE-dock.
Protein exibility is an important aspect of a protein-ligand docking pro-
gram. Other degrees of freedom that were not accounted for in this work,
but that can be introduced into ISE-dock with relative ease are modelling
of structurally important water molecules, as well as tautomeric and proto-
nation states.
Chapter 5
Conclusions
Iterative Stochastic Elimination is a generic optimization algorithm that aims
to solve highly complex combinatorial optimization problems in an ecient
and fast manner. We nd that it is able to solve the docking problem, as many
others, in polynomial time. Another advantage of ISE is its ability to pro-
duce arbitrarily large numbers of near-optimal solutions without substantial
penalty in terms of CPU time. ISE was rst implemented in our lab in 2000
to solve the problem of positioning polar protons in protein structures[30]
and is under constant development. Since then it was successfully imple-
mented for solving side chain positioning[31], structure prediction of cyclic
peptides[87], exible fragments in protein backbone[86, 76] and others.
ISE-dock is a new docking program based on the Iterative Stochas-
tic Elimination algorithm. The programs performance in exible ligand
rigid protein docking was compared to those of AutoDock, Glide and
GOLD on 81 complexes which are part of a set of complexes previously
chosen to compare docking programs. The ability to handle conformational
97
Conclusions 98
changes in the backbone and the side chains of the protein was assessed by
three independent data sets: collagenase (backbone exibility, 2 structures),
acetylcholinesterase (single side chain exibility, 2 structures) and trypsin
(exibility of several side chains, 10 strucures).
In exible ligand rigid protein docking, ISE-dock performs better than
the three docking programs with these complexes. ISE-dock succeeds in
docking all the 81 complexes with at least one solution of RMSD <3.0

A
among the top 20 scored poses (LGA of AutoDock nds 97.5%, Glide nds
90.1% and GOLD nds 87.7%), and with at least one RMSD<2.0

A within
the entire docking population (LGA nds 96.3%, no information is available
on Glide and GOLD). PTT of top 20 solutions and all the available solu-
tions, applied to the results of ISE-dock and to the other algorithms, shows
a clear advantage for ISE-dock.
The more signicant results of the exible ligand - rigid protein docking
experiments are provided by the ability of ISE to achieve large near-optimal
populations of solutions without a signicant additional CPU eort. These
populations improve the coverage of solution space and may be used to es-
timate the shape of energy landscapes near minima and to suggest multiple
binding modes, as was demonstrated in two cases (p38 MAP kinase 1kv1
and Human Transthyretin 2rox). The ability to analyze energy landscapes
accessible to ligands in a pocket has thus been shown to be useful. However,
the accuracy of that analysis can not be fully assessed yet due to the lack
of experimental data. Although, theoretically, such an analysis of very large
docking populations is possible with other docking programs, to the best of
our knowledge, the energy (score) vs RMSD plots of docking solutions, al-
Conclusions 99
though known previously were not used to visualize and estimate the energy
landscape of a protein ligand complex.
Accounting for protein exibility introduces additional degrees of free-
dom, but gives a more realistic representation of biological systems. Handling
of protein exibility was introduced into ISE-dock in a partial way. Even in
this premature implementation, ISE-dock was shown to successfully dock
exible ligands into partially exible protein structures, which include a few
side chains and consider backbone exibility. In all the cases, the docking
populaitions obtained by ISE-dock contained good to excellent solutions.
In the collagenase data set (Section 4.1, exible ligand were successfully
docked into protein structures with partially exible loops. The accuracy in
predicting the structure of the backbone is very high with RMSD of backbone
atoms as low as 0.13

A from the crystal structure. Although the top ranked


solutions for ligand positions were of high RMSD from the experimental
structure (2.25

A 2.49

A), the docking populations contained high quality


solutions (RMSD of 1.18

A 1.76

A).
Docking experiments with side chain exibility (AChE, Section 4.2 and
trypsin, Section 4.3) were even more accurate: in the AChE case, the docking
populations contained solutions with RMSD values as low as 0.37

A and in the
case of trypsin, the best populaition contained a solution with RMSD=0.30

A.
The experiments presented in this work show that ISE is capable of solv-
ing very complex problems. In addition to molecular exibility, such prob-
lems may target protonation and tautomerizatioin states of both the protein
and the ligand, explicit simulation of water molecules etc. The latter task is
of great importance, as it is known (see for examples [85, 104]) that including
Conclusions 100
water molecules improves the quality of docking results. In order to equip
ISE-dock with all these important features, one has to overcome two major
obstacles: (1) adaptation of the grid based scoring function to correctly treat
conformational changes in the protein and (2) docking several molecules (or
any independent entities) simultaneously.
Appendix A
Results published in a peer
reviewed journal
Following is the letter from the editor of PROTEINS: Structure, Function,
and Bioinformatics journal that noties the fact that an article based on
this work has been accepted for publication.
Return-path: <onbehalfof@scholarone.com>
Envelope-to: boris@gorelik.net
...
Message-ID:
<439655644.1187888215280.JavaMail.wladmin@mcv3-wl18>
Date: Thu, 23 Aug 2007 12:56:55 -0400 (EDT)
From: PSFBeditor@jhu.edu
To: boris@gorelik.net
Subject: PROTEINS: Manuscript Prot-00274-2007.R1 Accepted
Cc: amiram@vms.huji.ac.il
Errors-To: proteins@jhu.edu, proteinsadmin@wiley.com
PROTEINS: Structure, Function, and Bioinformatics
23-Aug-2007
Dear Mr. Boris Gorelik:
Your manuscript entitled "High quality binding modes in docking
ligands to proteins" has passed all required peer review and has
been recommended to me by the Editorial Board. I am pleased
to accept the paper for publication in the next available issue of
PROTEINS.
101
Results published in a peer reviewed journal 102
You will receive an e-mail immediately following with instructions
for production of your article. I look forward to seeing it in press.
Congratulations on submitting such an excellent study.
Sincerely,
Eaton E. Lattman
Editor-in-Chief
PROTEINS: Structure, Function, and Bioinformatics
The Johns Hopkins University
Department of Biophysics
Baltimore, MD 21218 U.S.A.
Appendix B
ISE-dock and AutoDock
parameters and their values
B.1 AutoDock parameters and their
default values
Following are the default parameters of AutoDock v 3.0.5 and their short
description. For more details see the manual published by AutoDock au-
thors
seed time pid # for random number generator
types CANOSH # atom type names
fld [PROTEIN_NAME].maps.fld # grid data file
map [PROTEIN_NAME].C.map # C-atomic affinity map file
map [PROTEIN_NAME].A.map # A-atomic affinity map file
map [PROTEIN_NAME].N.map # N-atomic affinity map file
map [PROTEIN_NAME].O.map # O-atomic affinity map file
map [PROTEIN_NAME].S.map # S-atomic affinity map file
map [PROTEIN_NAME].H.map # H-atomic affinity map file
map [PROTEIN_NAME].e.map # electrostatics map file
move [LIGAND_NAME].pdbq # small molecule file
about [X],[Y],[Z] # small molecule center
# Initial Translation, Quaternion and Torsions
tran0 random # initial coordinates/A or "random"
quat0 random # initial quaternion or "random"
ndihe 10 # number of initial torsions
dihe0 random # initial torsions
torsdof 0 0.3113 # num. non-Hydrogen torsional DOF & coeff.
103
ISE-dock and AutoDock parameters and their values 104
# Initial Translation, Quaternion and Torsion Step Sizes
# and Reduction Factors
tstep 2.0 # translation step/A
qstep 50.0 # quaternion step/deg
dstep 50.0 # torsion step/deg
trnrf 1. # trans reduction factor/per cycle
quarf 1. # quat reduction factor/per cycle
dihrf 1. # tors reduction factor/per cycle
# Internal Non-Bonded Parameters
intnbp_r_eps 4.00 0.0222750 12 6 #C-C lj
[LENNARD JONES PARAMETERS FOR EACH PAIR OF ATOM TYPES]
intnbp_r_eps 2.00 0.0029700 12 6 #H-H lj
outlev 1 # diagnostic output level
# Docked Conformation Clustering Parameters for
# "analysis" command
rmstol 1.0 # cluster tolerance (Angstroms)
rmsref [LIGAND_NAME].pdbq # reference structure
# file for RMS calc.
write_all # write all conformations in a cluster
extnrg 1000. # external grid energy
e0max 0. 10000 # max. allowable initial energy,
# max. num. retries
# Genetic Algorithm (GA) and Lamarckian
# Genetic Algorithm (LGA) Parameters
ga_pop_size 50 # number of individuals in population
ga_num_evals 250000 # maximum number of
# energy evaluations
ga_num_generations 27000 # maximum number
#of generations
ga_elitism 1 # num. of top individuals that
# automatically survive
ga_mutation_rate 0.02 # rate of gene mutation
ga_crossover_rate 0.80 # rate of crossover
ga_window_size 10 # num. of generations for
# picking worst individual
ga_cauchy_alpha 0 # ~mean of Cauchy distribution
# for gene mutation
ISE-dock and AutoDock parameters and their values 105
ga_cauchy_beta 1 # ~variance of Cauchy distribution
# for gene mutation
set_ga # set the above parameters for GA or LGA
# Local Search (Solis & Wets) Parameters
# (for LS alone and for LGA)
sw_max_its 300 # number of iterations of
# Solis & Wets local search
sw_max_succ 4 # number of consecutive successes
# before changing rho
sw_max_fail 4 # number of consecutive failures before
# changing rho
sw_rho 1.0 # size of local search space to sample
sw_lb_rho 0.01 # lower bound on rho
ls_search_freq 0.06 # probability of performing local
# search on an indiv.
set_psw1 # set the above pseudo-Solis & Wets parameters
# Perform Dockings
ga_run 10 # do this many GA or LGA runs
# Perform Cluster Analysis
analysis # do cluster analysis on results
B.2 ISE-dock parameters and their
default values
Following are the default parameters of ISE-dock. Parameters that are
common to AutoDock are not listed here.
# ISE docking parameters
ise_sample_size -50 # sample size. negative values mean that
# the size will be the product of current pool depth and
# the absolute value of this parameter
ise_conf_in_h_l -2 # number of conformations in the
# highest- and lowest- energy subsets. negative values
# mean that the size will be the product of current pool
# depth and the absolute value of this parameter
ise_output_size 40 # number of solutions in the final
# docking set
ISE-dock and AutoDock parameters and their values 106
ise_z_value 3.84 # statistical value that determines
# the rigidity of the elimination process
ise_elimination_fraction 0.1 # limit the number of values
# that can be eliminated from any given gene
ise_threshold 1e5 # threshold to switch from the
# stochastic to the exhaustive search
ise_method stochastic # one of the following:
# stochastic exhaustive
ise_pool_file <use_dpf> # if file name is specified,
# read the initial pool from it if <use_dpf>, then
# use the *grid parameters listed below to initialize
# the possibilities pool
ise_t_grid 1.5 # translation grid
ise_r_grid 6 # rotation grid
ise_d_grid 6 # dihedral torsions grid
ise_optimize_solution FALSE # perform local
# optimization on the final docking solution
ise_optimize_on_elimination TRUE # perform local
# optimization during the elimination phase. use the
# value of ls_search_freq parameter for probability
# of performing local search
ise_optimize_on_exhaustive_freq 0.6 # probability
#of local search during the exhaustive phase
set_ise # set the above parameters
# Perform ISE docking
ise_run
# Perform Cluster Analysis
analysis # do cluster analysis on results
Appendix C
Detailed Results
C.1 Flexible Ligand Rigid Protein docking re-
sults results
Table C.1.
Top scoring pose Best RMSD
Top 20 All available
C
O
D
E
I
S
E
L
G
A
G
l
i
d
e
G
O
L
D
I
S
E
L
G
A
G
l
i
d
e
G
O
L
D
I
S
E
L
G
A
13gs 1.86 2.30 2.81 1.52 0.46 0.72 2.69 1.09 0.25 0.58
1a42 1.65 3.30 1.47 5.28 0.47 0.97 1.47 2.26 0.47 0.79
1a4k 1.88 1.91 2.29 2.33 1.50 1.54 1.38 1.81 0.76 1.46
1a8t 2.27 3.51 1.11 4.69 0.86 0.80 1.11 2.07 0.85 0.71
1afq 2.07 2.93 1.12 1.35 1.06 1.01 0.53 1.35 1.06 1.01
1atl 3.21 3.04 2.10 1.55 0.95 1.22 1.46 1.55 0.92 1.04
1azm 2.33 2.81 2.04 2.60 1.97 2.17 1.24 0.66 0.54 1.97
1bnw 3.93 4.21 4.36 4.88 1.03 3.02 1.35 4.30 0.61 1.12
1bqo 0.92 0.61 1.60 1.55 0.72 0.51 1.60 1.35 0.72 0.48
1br6 1.85 1.85 3.51 1.82 1.64 1.83 1.69 0.63 0.44 1.82
1cet 2.05 4.21 3.05 8.52 1.71 1.88 2.80 5.30 0.75 1.81
1cim 1.16 1.16 1.54 1.30 0.66 0.65 1.34 1.03 0.23 0.58
Continued on next page
107
Detailed Results 108
Table C.1 continued from previous page
Top scoring pose Best RMSD
Top 20 All available
C
O
D
E
I
S
E
L
G
A
G
l
i
d
e
G
O
L
D
I
S
E
L
G
A
G
l
i
d
e
G
O
L
D
I
S
E
L
G
A
1d3p 1.32 3.91 2.40 4.03 1.03 0.86 1.61 1.57 0.91 0.85
1d4p 0.98 1.56 2.35 2.69 0.74 0.86 0.74 0.99 0.50 0.79
1d6v 2.31 2.50 4.06 4.08 1.79 2.36 2.01 1.68 0.97 2.19
1efy 2.53 4.45 1.95 2.88 1.98 2.03 0.38 0.69 0.52 1.95
1ela 1.15 1.55 0.75 1.25 1.14 0.87 0.75 1.06 1.08 0.87
1etr 1.71 0.66 1.49 2.60 1.19 0.66 1.15 2.18 1.01 0.66
1ett 2.55 4.59 0.92 4.37 0.85 0.72 0.65 1.29 0.85 0.72
1eve 1.52 2.58 1.94 2.39 0.58 0.59 1.15 1.03 0.51 0.52
1exa 0.52 0.46 0.43 0.41 0.36 0.44 0.43 0.41 0.23 0.41
1ezq 2.65 2.19 10.63 2.25 1.68 1.06 4.30 1.10 1.58 1.02
1f0r 1.53 1.66 8.72 3.19 0.80 0.62 1.90 1.23 0.80 0.62
1f0t 1.24 4.84 2.26 2.12 0.84 0.89 1.60 2.06 0.84 0.89
1f4e 3.92 3.92 1.23 1.75 2.46 1.73 1 1.55 0.56 1.36
1fcx 0.58 0.58 0.48 0.74 0.50 0.55 0.48 0.49 0.20 0.53
1fcz 0.57 0.59 0.77 0.91 0.45 0.54 0.52 0.50 0.24 0.49
1fjs 1.49 1.59 5.04 2.12 1.31 0.73 3.44 1.44 1.31 0.73
1fkg 1.07 1.20 1.75 4.18 0.93 0.93 1.67 4.05 0.93 0.93
1fm6 2.84 0.40 0.64 0.68 0.69 0.35 0.64 0.65 0.69 0.35
1fm9 1.72 1.60 1.74 3.38 1.21 0.85 1.74 1.49 1.17 0.85
1g4o 3.70 3.99 2.15 4.59 2.21 2.92 1.62 0.81 0.58 2.44
1h1p 4.08 3.72 0.65 1.21 1.35 1.35 0.65 0.52 0.38 1.31
1h1s 0.80 0.62 0.97 1.16 0.61 0.42 0.97 1.16 0.58 0.36
1h9u 0.59 0.53 0.82 1.12 0.33 0.47 0.48 1.03 0.33 0.35
1hdq 1.07 1.88 2.16 3.67 0.55 0.84 0.62 0.84 0.37 0.77
1hfc 1.55 4.47 2.37 2.34 1.40 0.98 1 0.61 1.34 0.98
1hpv 1.11 1.73 1.20 9.47 1.01 0.88 1.19 1.38 1.01 0.88
Continued on next page
Detailed Results 109
Table C.1 continued from previous page
Top scoring pose Best RMSD
Top 20 All available
C
O
D
E
I
S
E
L
G
A
G
l
i
d
e
G
O
L
D
I
S
E
L
G
A
G
l
i
d
e
G
O
L
D
I
S
E
L
G
A
1htf 2.55 1.64 10.12 10.19 1.53 0.59 1.99 3.13 1.49 0.59
1i7z 0.87 1.02 0.60 0.86 0.45 0.82 0.44 0.82 0.45 0.38
1i8z 0.72 1.92 3.82 3.66 0.55 0.74 2.55 2.69 0.39 0.63
1if7 3.65 4.40 1.43 5.42 1.64 3.65 1.34 1.65 0.87 2.74
1iy7 0.96 1.04 1.16 0.91 0.75 0.99 0.99 0.59 0.75 0.77
1jsv 0.88 1.25 5.45 6.94 0.74 0.71 3.40 5.36 0.69 0.71
1k1j 4.11 1.47 5.88 6.54 1.59 1.23 4.48 3.24 1.57 1.23
1k22 1.69 0.55 0.74 1.03 1.06 0.42 0.74 0.72 1.06 0.41
1k7e 0.88 0.74 0.72 0.96 0.56 0.53 0.68 0.53 0.21 0.31
1k7f 0.79 0.77 2.02 0.84 0.69 0.68 0.51 0.76 0.69 0.66
1kv1 1.21 1.21 0.66 0.81 0.70 1.14 0.59 0.56 0.27 0.66
1kv2 0.73 0.78 1.63 0.80 0.58 0.69 0.91 0.74 0.52 0.63
1l8g 1.33 1.60 2.90 2.17 0.74 1.50 1.57 2.17 0.70 1.16
1lqd 0.89 0.39 1.93 0.65 0.74 0.31 1.93 0.45 0.74 0.31
1m48 1.89 1.12 0.68 1.64 1.10 0.55 0.68 1.12 1.10 0.55
1mmb 2.11 2.12 3.18 6.11 1.79 1.32 1.16 1.37 1.64 1.32
1mnc 3.96 0.69 0.36 1.95 1.53 0.60 0.36 1.38 1.21 0.60
1nhu 3.38 3.51 6.07 5.17 1.02 1.07 3.16 3.75 0.69 1.07
1nhv 3.26 4.68 6.57 8.95 1.35 1.76 5.96 4.45 1.04 1.76
1o86 3.46 1.25 1.06 1.85 1.80 1.25 0.97 0.99 1.54 1.25
1ppc 1.60 1.59 1.69 1.76 1.37 1.20 1.62 1.76 1.30 1.20
1pph 3.39 2.38 5.09 4.95 1.36 1.42 1.09 0.88 1.02 1.42
1qbu 0.97 0.72 10.36 2.59 0.86 0.66 10.36 2.59 0.86 0.66
1qhi 0.66 0.69 0.30 0.66 0.51 0.58 0.30 0.41 0.31 0.55
1qpe 0.63 0.67 1.50 0.52 0.44 0.47 0.52 0.34 0.25 0.45
1r09 5.99 5.95 0.82 1.81 1.85 1.50 0.82 0.53 0.49 1.21
Continued on next page
Detailed Results 110
Table C.1 continued from previous page
Top scoring pose Best RMSD
Top 20 All available
C
O
D
E
I
S
E
L
G
A
G
l
i
d
e
G
O
L
D
I
S
E
L
G
A
G
l
i
d
e
G
O
L
D
I
S
E
L
G
A
1thl 2.88 2.12 8.54 10.08 1.72 1.15 1.78 2.12 1.11 1.15
1uvt 0.85 0.60 0.44 1.47 0.66 0.49 0.44 0.54 0.66 0.49
1ydr 1.51 0.65 1.56 2.52 0.53 0.62 0.67 2.52 0.32 0.57
1yds 0.69 0.66 0.50 0.55 0.54 0.60 0.50 0.55 0.49 0.60
2cgr 0.79 0.85 0.85 6.54 0.62 0.73 0.67 6.35 0.62 0.66
2pcp 1 0.99 0.64 3.89 0.30 0.96 0.62 1.08 0.30 0.95
2qwi 0.56 0.71 0.70 1.30 0.37 0.60 0.70 0.96 0.37 0.51
3cpa 0.84 0.85 0.79 0.73 0.69 0.62 0.53 0.60 0.69 0.61
3erk 0.59 0.72 0.44 1.42 0.25 0.64 0.44 0.63 0.21 0.64
3ert 1.14 1.44 4.66 4.74 0.88 1.03 2.48 2.39 0.88 0.90
3std 0.60 0.56 2.44 0.85 0.40 0.48 2.44 0.85 0.39 0.35
3tmn 0.66 3.09 8.07 7.59 0.54 0.58 3.18 3.90 0.48 0.58
4dfr 1.10 1.01 1.27 1.20 0.74 0.81 1.10 1.18 0.72 0.81
5std 0.52 0.47 0.73 0.86 0.34 0.42 0.73 0.58 0.28 0.40
5tln 1.73 3.82 9.67 6.52 1.11 0.88 1.20 1.01 1.11 0.88
7est 0.84 0.79 1.02 3.76 0.75 0.63 0.82 0.87 0.75 0.63
966c 1.05 0.70 2.44 2.42 0.81 0.55 2.21 2.34 0.81 0.55
Table C.1: Detailed docking results of the exible ligand rigid protein
data set. RMSD[

A]
Detailed Results 111
C.2 Flexible ligand rigid protein docking energy
landscapes
Following are the energy vs RMSD plots for ISE-dock and AutoDock of
all the 81 complexes in the exible ligand - rigid protein docking set. The
graphs are sorted alphabetically according to the PDB code of the complex.
Detailed Results 112
F
i
g
u
r
e
C
.
1
:
E
n
e
r
g
y
v
s
R
M
S
D
p
l
o
t
s
f
o
r
I
S
E
-
d
o
c
k
(
r
e
d
)
a
n
d
A
u
t
o
D
o
c
k
(
g
r
e
e
n
)
o
f
c
o
m
p
l
e
x
e
s
i
n
t
h
e

e
x
i
b
l
e
l
i
g
a
n
d
-
r
i
g
i
d
p
r
o
t
e
i
n
d
o
c
k
i
n
g
s
e
t
.
T
h
e
g
r
a
p
h
s
a
r
e
s
o
r
t
e
d
a
l
p
h
a
b
e
t
i
c
a
l
l
y
a
c
c
o
r
d
i
n
g
t
o
t
h
e
P
D
B
c
o
d
e
o
f
t
h
e
c
o
m
p
l
e
x
.
C
o
n
t
i
n
u
e
d
o
n
t
h
e
f
o
l
l
o
w
i
n
g

g
u
r
e
s
.
Detailed Results 113
F
i
g
u
r
e
C
.
2
:
C
o
n
t
i
n
u
e
d
f
r
o
m
t
h
e
p
r
e
v
i
o
u
s

g
u
r
e
.
E
n
e
r
g
y
v
s
R
M
S
D
p
l
o
t
s
f
o
r
I
S
E
-
d
o
c
k
(
r
e
d
)
a
n
d
A
u
t
o
D
o
c
k
(
g
r
e
e
n
)
o
f
c
o
m
p
l
e
x
e
s
i
n
t
h
e

e
x
i
b
l
e
l
i
g
a
n
d
-
r
i
g
i
d
p
r
o
t
e
i
n
d
o
c
k
i
n
g
s
e
t
.
T
h
e
g
r
a
p
h
s
a
r
e
s
o
r
t
e
d
a
l
p
h
a
b
e
t
i
c
a
l
l
y
a
c
c
o
r
d
i
n
g
t
o
t
h
e
P
D
B
c
o
d
e
o
f
t
h
e
c
o
m
p
l
e
x
.
Detailed Results 114
F
i
g
u
r
e
C
.
3
:
C
o
n
t
i
n
u
e
d
f
r
o
m
t
h
e
p
r
e
v
i
o
u
s

g
u
r
e
.
E
n
e
r
g
y
v
s
R
M
S
D
p
l
o
t
s
f
o
r
I
S
E
-
d
o
c
k
(
r
e
d
)
a
n
d
A
u
t
o
D
o
c
k
(
g
r
e
e
n
)
o
f
c
o
m
p
l
e
x
e
s
i
n
t
h
e

e
x
i
b
l
e
l
i
g
a
n
d
-
r
i
g
i
d
p
r
o
t
e
i
n
d
o
c
k
i
n
g
s
e
t
.
T
h
e
g
r
a
p
h
s
a
r
e
s
o
r
t
e
d
a
l
p
h
a
b
e
t
i
c
a
l
l
y
a
c
c
o
r
d
i
n
g
t
o
t
h
e
P
D
B
c
o
d
e
o
f
t
h
e
c
o
m
p
l
e
x
.
Detailed Results 115
F
i
g
u
r
e
C
.
4
:
C
o
n
t
i
n
u
e
d
f
r
o
m
t
h
e
p
r
e
v
i
o
u
s

g
u
r
e
.
E
n
e
r
g
y
v
s
R
M
S
D
p
l
o
t
s
f
o
r
I
S
E
-
d
o
c
k
(
r
e
d
)
a
n
d
A
u
t
o
D
o
c
k
(
g
r
e
e
n
)
o
f
c
o
m
p
l
e
x
e
s
i
n
t
h
e

e
x
i
b
l
e
l
i
g
a
n
d
-
r
i
g
i
d
p
r
o
t
e
i
n
d
o
c
k
i
n
g
s
e
t
.
T
h
e
g
r
a
p
h
s
a
r
e
s
o
r
t
e
d
a
l
p
h
a
b
e
t
i
c
a
l
l
y
a
c
c
o
r
d
i
n
g
t
o
t
h
e
P
D
B
c
o
d
e
o
f
t
h
e
c
o
m
p
l
e
x
.
Detailed Results 116
F
i
g
u
r
e
C
.
5
:
C
o
n
t
i
n
u
e
d
f
r
o
m
t
h
e
p
r
e
v
i
o
u
s

g
u
r
e
.
E
n
e
r
g
y
v
s
R
M
S
D
p
l
o
t
s
f
o
r
I
S
E
-
d
o
c
k
(
r
e
d
)
a
n
d
A
u
t
o
D
o
c
k
(
g
r
e
e
n
)
o
f
c
o
m
p
l
e
x
e
s
i
n
t
h
e

e
x
i
b
l
e
l
i
g
a
n
d
-
r
i
g
i
d
p
r
o
t
e
i
n
d
o
c
k
i
n
g
s
e
t
.
T
h
e
g
r
a
p
h
s
a
r
e
s
o
r
t
e
d
a
l
p
h
a
b
e
t
i
c
a
l
l
y
a
c
c
o
r
d
i
n
g
t
o
t
h
e
P
D
B
c
o
d
e
o
f
t
h
e
c
o
m
p
l
e
x
.
Detailed Results 117
F
i
g
u
r
e
C
.
6
:
C
o
n
t
i
n
u
e
d
f
r
o
m
t
h
e
p
r
e
v
i
o
u
s

g
u
r
e
.
E
n
e
r
g
y
v
s
R
M
S
D
p
l
o
t
s
f
o
r
I
S
E
-
d
o
c
k
(
r
e
d
)
a
n
d
A
u
t
o
D
o
c
k
(
g
r
e
e
n
)
o
f
c
o
m
p
l
e
x
e
s
i
n
t
h
e

e
x
i
b
l
e
l
i
g
a
n
d
-
r
i
g
i
d
p
r
o
t
e
i
n
d
o
c
k
i
n
g
s
e
t
.
T
h
e
g
r
a
p
h
s
a
r
e
s
o
r
t
e
d
a
l
p
h
a
b
e
t
i
c
a
l
l
y
a
c
c
o
r
d
i
n
g
t
o
t
h
e
P
D
B
c
o
d
e
o
f
t
h
e
c
o
m
p
l
e
x
.
Detailed Results 118
Figure C.7: Continued from the previous gure. Energy vs RMSD plots for ISE-dock
(red) and AutoDock (green) of complexes in the exible ligand - rigid protein docking
set. The graphs are sorted alphabetically according to the PDB code of the complex.
Appendix D
Flexible ligand exible protein
docking. Trypsin data set
Table D.1
Ligand only atoms All movable atoms
ligand protein top1 top20 top4096 top1 top20 top4096
1ppc 1ppc 1.72 0.87 0.87 1.84 1.26 1.23
1ppc 1pph 3.38 2.48 1.42 2.8 2.17 1.6
1ppc 1tng 2.84 1.44 1.27 2.59 1.48 1.48
1ppc 1tnh 3.02 1.91 1.59 2.56 1.83 1.6
1ppc 1tni 2.59 1.53 1.08 2.3 1.53 1.31
1ppc 1tnj 1.99 1.3 1.25 2.21 1.73 1.64
1ppc 1tnk 2.73 1.13 1.02 2.48 1.41 1.4
1ppc 1tnl 3.05 1.34 1.34 2.85 1.74 1.65
1ppc 1tpp 3.44 1.6 1.33 3.02 2.01 1.69
1ppc 3ptb 2.49 1.69 1.45 2.4 1.94 1.84
1pph 1ppc 3.86 2.38 2.04 3.86 2.38 2.04
1pph 1pph 4.66 2.14 1.7 4.66 2.14 1.7
1pph 1tng 4.56 1.97 1.69 4.56 1.97 1.69
1pph 1tnh 4.31 2.21 1.74 4.31 2.21 1.74
1pph 1tni 4.5 1.99 1.57 4.5 1.99 1.57
1pph 1tnj 3.88 2.01 1.59 3.88 2.01 1.59
1pph 1tnk 4.27 2.7 1.79 4.27 2.7 1.79
Continued on next page
119
Flexible ligand exible protein docking. Trypsin data set 120
Table D.1 continued from previous page
Ligand only atoms All movable atoms
ligand protein top1 top20 top4096 top1 top20 top4096
1pph 1tnl 2.77 1.96 1.42 2.77 1.96 1.42
1pph 1tpp 4.46 2.51 1.9 4.46 2.51 1.9
1pph 3ptb 3.57 2.22 1.76 3.57 2.22 1.76
1tng 1ppc 0.97 0.64 0.43 1.85 1.55 1.03
1tng 1pph 1.12 0.99 0.96 1.92 1.75 1.6
1tng 1tng 0.53 0.43 0.28 1.58 1.27 0.89
1tng 1tnh 0.64 0.54 0.4 1.69 1.32 1.09
1tng 1tni 0.99 0.63 0.63 1.9 1.47 1.25
1tng 1tnj 0.77 0.54 0.42 2.07 1.4 1.01
1tng 1tnk 0.9 0.59 0.55 1.91 1.29 1.08
1tng 1tnl 0.62 0.5 0.38 1.28 1.03 0.85
1tng 1tpp 1.04 0.61 0.53 2.34 1.75 1.58
1tng 3ptb 1 0.78 0.66 2.25 2.04 1.86
1tnh 1ppc 3.36 1.56 1.3 3.36 1.56 1.3
1tnh 1pph 4.36 1.5 1.08 4.36 1.5 1.08
1tnh 1tng 2.82 1.36 1.15 2.82 1.36 1.15
1tnh 1tnh 3.39 1.4 1.18 3.39 1.4 1.18
1tnh 1tni 2.56 1.41 1.17 2.56 1.41 1.17
1tnh 1tnj 2.08 1.31 1.11 2.08 1.31 1.11
1tnh 1tnk 3.51 1.43 1.09 3.51 1.43 1.09
1tnh 1tnl 2.07 1.41 1.23 2.07 1.41 1.23
1tnh 1tpp 2.12 1.45 1.21 2.12 1.45 1.21
1tnh 3ptb 2.78 1.82 1.27 2.78 1.82 1.27
1tni 1ppc 2.95 2 1.29 2.95 2 1.29
1tni 1pph 3.09 1.85 1.39 3.09 1.85 1.39
1tni 1tng 2.83 1.6 1.19 2.83 1.6 1.19
1tni 1tnh 2.32 1.68 1.35 2.32 1.68 1.35
1tni 1tni 2.53 1.71 1.3 2.53 1.71 1.3
1tni 1tnj 2.31 1.81 1.2 2.31 1.81 1.2
1tni 1tnk 4.12 1.64 1.2 4.12 1.64 1.2
Continued on next page
Flexible ligand exible protein docking. Trypsin data set 121
Table D.1 continued from previous page
Ligand only atoms All movable atoms
ligand protein top1 top20 top4096 top1 top20 top4096
1tni 1tnl 3.81 1.5 1.14 3.81 1.5 1.14
1tni 1tpp 4.12 1.89 1.5 4.12 1.89 1.5
1tni 3ptb 2.66 1.85 1.22 2.66 1.97 1.22
1tnj 1ppc 3.99 2.17 1.27 3.99 2.17 1.27
1tnj 1pph 3.24 1.84 1.35 3.24 1.84 1.35
1tnj 1tng 3.85 1.48 1.15 3.85 1.49 1.15
1tnj 1tnh 2.5 1.49 1.19 2.5 1.49 1.19
1tnj 1tni 3.25 1.68 1.29 3.25 1.68 1.29
1tnj 1tnj 2.34 1.44 1.14 2.34 1.44 1.14
1tnj 1tnk 2.77 1.43 1.07 2.77 1.43 1.07
1tnj 1tnl 2.46 1.49 1.04 2.46 1.49 1.04
1tnj 1tpp 2.2 1.72 1.3 2.2 1.72 1.3
1tnj 3ptb 3.53 1.67 1.22 3.53 1.67 1.22
1tnk 1ppc 4.74 1.95 1.62 4.74 1.95 1.62
1tnk 1pph 4.28 1.79 1.42 4.28 1.79 1.42
1tnk 1tng 2.9 1.66 1.39 2.9 1.66 1.39
1tnk 1tnh 3.62 1.56 1.17 3.62 1.56 1.17
1tnk 1tni 2.66 1.61 1.42 2.66 1.61 1.42
1tnk 1tnj 2.6 1.59 1.28 2.6 1.59 1.28
1tnk 1tnk 3.52 1.41 1.25 3.52 1.41 1.25
1tnk 1tnl 4.44 1.73 1.43 4.44 1.73 1.43
1tnk 1tpp 3.67 1.75 1.45 3.67 1.75 1.45
1tnk 3ptb 3.09 1.65 1.37 3.09 1.65 1.37
1tnl 1ppc 4.46 1.86 1.12 4.46 1.86 1.12
1tnl 1pph 2.1 1.5 1.21 2.1 1.5 1.21
1tnl 1tng 3.05 1.34 1.14 3.05 1.34 1.14
1tnl 1tnh 2.56 1.33 1.18 2.56 1.33 1.18
1tnl 1tni 2.67 1.34 1.07 2.67 1.34 1.07
1tnl 1tnj 1.78 1.33 1.23 1.78 1.33 1.23
1tnl 1tnk 2.72 1.38 1.1 2.72 1.38 1.1
Continued on next page
Flexible ligand exible protein docking. Trypsin data set 122
Table D.1 continued from previous page
Ligand only atoms All movable atoms
ligand protein top1 top20 top4096 top1 top20 top4096
1tnl 1tnl 3.29 1.3 1.03 3.29 1.3 1.03
1tnl 1tpp 2.06 1.37 1.37 2.06 1.37 1.37
1tnl 3ptb 3.39 1.38 1.21 3.39 1.38 1.21
1tpp 1ppc 4.85 3.09 2.06 4.84 3.09 2.06
1tpp 1pph 5.58 2.73 1.94 5.58 2.73 1.94
1tpp 1tng 5.15 4.5 2.2 5.15 4.5 2.2
1tpp 1tnh 4.56 3.44 1.77 4.56 3.44 1.77
1tpp 1tni 4.61 4.39 2.49 4.61 4.39 2.49
1tpp 1tnj 4.5 3.54 1.96 4.5 3.54 1.96
1tpp 1tnk 5.53 4.06 1.99 5.53 4.06 1.99
1tpp 1tnl 5.19 4.11 2.74 5.19 4.11 2.74
1tpp 1tpp 4.23 2.61 1.78 4.23 2.61 1.78
1tpp 3ptb 5.97 3.99 2.82 5.97 3.99 2.82
3ptb 1ppc 3.04 1.75 0.98 3.04 1.75 0.98
3ptb 1pph 3.28 1.61 1.11 3.28 1.61 1.11
3ptb 1tng 3.16 2.59 0.97 3.16 2.59 0.97
3ptb 1tnh 3.7 2.3 1.18 3.7 2.3 1.18
3ptb 1tni 3.17 2.35 0.93 3.17 2.35 0.93
3ptb 1tnj 2.7 2.32 1.37 2.7 2.32 1.37
3ptb 1tnk 3.49 1.95 1.2 3.49 1.95 1.2
3ptb 1tnl 3.09 2.38 1.09 3.09 2.38 1.09
3ptb 1tpp 3.14 1.94 0.77 3.14 1.94 0.77
3ptb 3ptb 2.8 2.22 1.24 2.8 2.22 1.24
Table D.1: RMSD [

A ] of all movable atoms and of ligand atoms only


in the trypsin data set, 100 cross docking experiments
List of Figures
1.1 Schematic diagram of the main methods in the drug discov-
ery process. Arrows designate process ow. Black asterisks
mark steps that may involve molecular docking. Abbrevia-
tions: SAR structure-activity relationship; QSAR quan-
titative SAR; ADME-Tox absorption, distribution, elimina-
tion, toxicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Typical shapes of electrostatic interactions energy. The energy
of two identical (full line) and opposite (dashed line) charges
in vacuum are shown . . . . . . . . . . . . . . . . . . . . . . . 14
1.3 Examples of inter- (left) and intra- (right) molecular H-bonds 15
1.4 Van der Waals interaction energy of argon dimer. Taken from
the Wikipedia [113] under the GNU Free Documentation License 16
1.5 Comparison of Morse (dashed line) and Hookes harmonic (full
line) potentials of bond stretching energy around the mini-
mum. To construct this graph, all the parameters in equations
(1.15) and (1.16) were assigned the value of 1 . . . . . . . . . 17
2.1 Tearing o atoms to represent side chain exibility using
phenylalanine as an example. Dummy atoms are marked by
the letter D in their names. The N, C

and C

atoms on
the receptor molecule overlap with their respective dummy
counterparts. . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.2 Structural alignment of 456c and 966c. Backbone traces of
the proteins are color coded according to the distance (in

A)
between the aligned backbone atoms. RS-130830 (red) and
RS-104966 (green) are shown as sticks models. . . . . . . . . . 58
123
List of Figures 124
2.3 Cross section of AChE complexed with acetylcholine (PDB
code: 2ace), colored by (A) partial charge of the atoms and
(B) by the residue type (colored by PyMol): hydrophobic
(GILMPV) white, aromatic (FWY) magenta, semipolar
(C) yellow, polar (HNQST) cyan, positive (KR) blue,
negative (DE) red. Acetylcholine is colored blue in both
panes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.4 AChE complexed with Huperzine A (PDB code: 1vot, light
gray) and with Aricept (PDB code: 1eve, dark gray). The
ligands and Phe 330 side chains from both the complexes are
highlighted using sticks. . . . . . . . . . . . . . . . . . . . . . 61
2.5 Trypsin data set. 10 superimposed trypsin structures: 1ppc,
1pph, 1tng, 1tnh, 1tni, 1tnj, 1tnk, 1tnl, 1tpp and 3ptb. The
ligand molecules and the residues that are treated as exible
are shown as sticks. The remaining parts of the proteins are
shown as backbone trace. . . . . . . . . . . . . . . . . . . . . . 63
3.1 Top single docking poses at dierent RMSD bins with respect
to crystal structures, 4 dierent programs. Results for Glide
and GOLD were obtained by Perola et al.[84]. . . . . . . . . . 67
3.2 Top 20 docking poses, RMSD to corresponding crystal struc-
tures. Results for Glide and GOLD were obtained by Perola
et al.[84]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.3 Top available docking poses produced in equal CPU times,
RMSD to corresponding crystal structures. The numbers of
poses are 4096 (ISE) and 35 (LGA). . . . . . . . . . . . . . . . 71
3.4 Number of iterations before switching to exhaustive search
as a function of initial combinatorial size (number of initial
combinations). . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.5 A: Energy vs RMSD plot for docking populations of the com-
plex 1yds obtained with ISE, showing a single distinct funnel.
B: the same plot for 35 solutions obtained by LGA. The plots
are shown using the same scale. C: The rst 35 solutions (dark
lines) docked by ISE vs the ligand in the crystal (gray sticks).
Receptor residues with at least one atom within 5.5

A of the
ligand are shown as light gray cartoon. All structures in this
work were visualized using PyMol[15]. . . . . . . . . . . . . . . 74
List of Figures 125
3.6 A: Energy vs RMSD plot for docking populations of the com-
plex 1bqo obtained with ISE, showing two distinct funnels. B:
the same plot for 35 solutions obtained by LGA. The plots
are shown using the same scale. C: The crystal structure of
the ligand (gray sticks) and the rst 35 solutions (dark lines)
docked by ISE. . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.7 A: Energy vs RMSD plot for docking populations of the com-
plex 1hpv obtained with ISE, showing a scatter of the results.
B: the same plot for 35 solutions obtained by LGA. The plots
are shown using the same scale. C: The crystal structure of
the ligand and the rst 35 solutions docked by ISE. . . . . . . 76
3.8 Cumulative fractions (Y-axis) of 81 ISE docking complexes
with an energy span between the global minimum of each (pose
number 1) and the other 4095 poses, below the given threshold
(X-axis). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.9 Complexes 1kv1 (light gray) and 1kv2 (dark gray) superim-
posed using backbone atoms. The ligands are shown as sticks
and backbone of closest (within 5.5

A) residues to the ligand
are shown as PyMol cartoons. . . . . . . . . . . . . . . . . . 79
3.10 Energy vs RMSD plot for docking populations obtained by
ISE (A) and LGA (B) of the complex 1kv1. The plots are
shown using the same scale. The best single ISE solutions at
each of the three funnels have ranks 1, 222 and 270 and are
marked with arrows. . . . . . . . . . . . . . . . . . . . . . . . 80
3.11 The best ISE-dock solution for 1kv1 (sticks). The crystal
structures of 1kv1 and 1kv2 ligands are shown for compari-
son (lines). 1kv1 is colored according to: C cyan, N blue,
Cl green. 1kv2 is colored according to: C yellow, N blue,
O red. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.12 ISE-dock solution for 1kv1, ranked 222 (sticks). The crystal
structures of 1kv1 and 1kv2 ligands are shown for comparison
(lines). The coloring scheme is identical to that of Figure 3.11 81
3.13 ISE-dock solution for 1kv1 solution ranked 270 (sticks). The
crystal structures of 1kv1 and 1kv2 ligands are shown for com-
parison (lines). The coloring scheme is identical to that on
Figure 3.11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
List of Figures 126
3.14 Energy vs. RMSD plot for docking populations of the complex
2rox, obtained by ISE (A) and LGA(B). The best single ISE
solutions at each of the two funnels have ranks 1 and 2 and are
marked with arrows. C: Antiparallel docking solutions ranked
1 and 2 for 2rox (green and magenta sticks respectively). The
carbons in the crystal structure of thyroxine are shown thin
sticks colored cyan. The backbone of closest (within 5.5

A)
residues to the ligand are shown in PyMol cartoon represen-
tation colored cyan. . . . . . . . . . . . . . . . . . . . . . . . . 83
4.1 The best available docking solution for (A) 1eve-1vot and (B)
1vot-1eve in unbound (cross-) docking experiments. The dock-
ing solutions for all the movable atoms are shown as lines and
the crystal structures are shown as sticks. The protein struc-
tures are shown as backbone trace. . . . . . . . . . . . . . . . 89
4.2 The best available docking solution for (A) 1eve-1eve and (B)
1vot-1vot in bound docking experiments. The docking solu-
tions for all the movable atoms are shown as lines and the
crystal structures are shown as sticks. The protein structures
are shown as backbone trace. . . . . . . . . . . . . . . . . . . 90
4.3 Top docking poses at dierent RMSD bins with respect to
crystal structures . . . . . . . . . . . . . . . . . . . . . . . . . 91
C.1 Energy vs RMSD plots for ISE-dock (red) and AutoDock
(green) of complexes in the exible ligand - rigid protein dock-
ing set. The graphs are sorted alphabetically according to the
PDB code of the complex. Continued on the following gures. 112
C.2 Continued from the previous gure. Energy vs RMSD plots
for ISE-dock (red) and AutoDock (green) of complexes in
the exible ligand - rigid protein docking set. The graphs are
sorted alphabetically according to the PDB code of the complex.113
C.3 Continued from the previous gure. Energy vs RMSD plots
for ISE-dock (red) and AutoDock (green) of complexes in
the exible ligand - rigid protein docking set. The graphs are
sorted alphabetically according to the PDB code of the complex.114
C.4 Continued from the previous gure. Energy vs RMSD plots
for ISE-dock (red) and AutoDock (green) of complexes in
the exible ligand - rigid protein docking set. The graphs are
sorted alphabetically according to the PDB code of the complex.115
List of Figures 127
C.5 Continued from the previous gure. Energy vs RMSD plots
for ISE-dock (red) and AutoDock (green) of complexes in
the exible ligand - rigid protein docking set. The graphs are
sorted alphabetically according to the PDB code of the complex.116
C.6 Continued from the previous gure. Energy vs RMSD plots
for ISE-dock (red) and AutoDock (green) of complexes in
the exible ligand - rigid protein docking set. The graphs are
sorted alphabetically according to the PDB code of the complex.117
C.7 Continued from the previous gure. Energy vs RMSD plots
for ISE-dock (red) and AutoDock (green) of complexes in
the exible ligand - rigid protein docking set. The graphs are
sorted alphabetically according to the PDB code of the complex.118
List of Tables
2.1 PDB codes of the 81 complexes in the rigid protein test set. . 51
2.2 Anities to collagenase . . . . . . . . . . . . . . . . . . . . . . 57
3.1 Summary of docking results by ISE, LGA, Glide and GOLD. . 65
3.2 Binding modes of 1-(5-tert-butyl-2-methyl-2h-pyrazol-3-yl)- 3-
(4-chloro-phenyl)-urea (from 1kv1) . . . . . . . . . . . . . . . 81
4.1 Collagenase data set, best ligands RMSD (

A) in top 1, top
20 and all available (4096) solutions. RMSD of the backbone
from the crystal position of the corresponding solution is also
reported. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.2 Results of Acetylcholinesterase cross docking . . . . . . . . . . 88
4.3 Torsion RMSD of exible residues in the trypsin data set . . . 91
4.4 Trypsin data set, RMSD values of top single docking poses
and best docking poses in top 20 and top 4096 solutions . . . 93
4.5 Current status of protein exibility handling ISE-dock and in
ve popular docking programs (sorted according to the num-
ber of citations in 2005[95]) . . . . . . . . . . . . . . . . . . . 94
C.1 Detailed docking results of the exible ligand rigid protein
data set. RMSD[

A] . . . . . . . . . . . . . . . . . . . . . . . . 110
D.1 RMSD [

A ] of all movable atoms and of ligand atoms only in


the trypsin data set, 100 cross docking experiments . . . . . . 122
128
Acknowledgments
First of all, I thank Prof. Amiram Goldblum, my supervisor, for the unlim-
ited freedom and trust and for his guidance and support.
This research was supported by the Israel Science Foundation (ISF) grant
no 608/02. I thank the Alex Grass Center for Drug Design and Synthesis for
further support, Dr. Emmanuele Perola for sending his data as well as for
making useful suggestions, Dr. Anwar Rayan for helpful discussions and Mrs.
Efrat Noy for her ideas and for helping with the programming. Dr. Morris
M. Garret was instrumental in solving our problems with AutoDock usage
and for providing suggestions for improving LGA results.
This work would not have been possible without the support of my wife,
Einat, who released me from all my domestic duties and supported me during
the preparation of this work.
129
Bibliography
[1] R Abagyan and M Totrov. High-throughput docking for lead generation.
Curr Opin Chem Biol, 5(4):375 82, 2001.
[2] R Abagyan, M Totrov, and D Kuznetsov. ICM a new method for protein
modeling and design: applications to docking and structure prediction from
the distorted native conformations. J Comp Chem, 15(5):488 506, 1994.
[3] LM Amzel. Calculation of entropy changes in biological processes: folding,
binding, and oligomerization. Methods Enzymol, 323:167177, 2000.
[4] AC Anderson, RH ONeil, TS Surti, and RM Stroud. Approaches to solving
the rigid receptor problem by identifying a minimal set of exible residues
during ligand docking. Chem Biol, 8(5):445457, May 2001.
[5] FC Bernstein, TF Koetzle, GJ B Williams, EF Meyer, MD Brice,
JR Rodgers, O Kennard, T Shimanouchi, and M Tasumi. Protein data
bank computer-based archival le for macromolecular structures. Arch
Biochem Biophys, 185(2):584 591, 1978.
[6] A Bialonska and Z Ciunik. Hydrophobic lock and key recognition of n-4-
nitrobenzoylamino acid by strychnine. Acta Crystallogr B Struct Sci, 62:1061
1070, 2006.
[7] W Cai, X Shao, and B Maigret. Protein-ligand recognition using spherical
harmonic molecular surfaces: towards a fast and ecient lter for large
virtual throughput screening. J Mol Graph Model, 20(4):313328, Jan 2002.
[8] CJ Camacho, DW Gatchell, SR Kimura, and S Vajda. Scoring docked
conformations generated by rigid-body protein-protein docking. Proteins,
40(3):525537, Aug 2000.
[9] MD Cameron, B Wen, KE Allen, AG Roberts, JT Schuman, AP Campbell,
KL Kunze, and SD Nelson. Cooperative binding of midazolam with testos-
terone and alpha-naphthoavone within the CYP3A4 active site: a NMR
T1 paramagnetic relaxation study. Biochemistry, 44(43):1414314151, Nov
2005.
130
Bibliography 131
[10] HA Carlson. Protein exibility and drug design: how to hit a moving target.
Curr Opin Chem Biol, 6(4):447452, Aug 2002.
[11] C Catana and PFW Stouten. Novel, customizable scoring functions, param-
eterized using n-pls, for structure-based drug discovery. J Chem Inf Model,
47(1):8591, 2007.
[12] H Claussen, C Buning, M Rarey, and T Lengauer. Flexe: ecient molecular
docking considering protein structure variations. J Mol Biol, 308(2):377395,
2001.
[13] JC Cole, CW Murray, JW Nissink, RD Taylor, and R Taylor. Comparing
protein-ligand docking programs is dicult. Proteins, 60(3):325332, Aug
2005.
[14] WD Cornell, P Cieplak, CI Bayly, IR Gould, KM Merz, DM Ferguson,
DC Spellmeyer, T Fox, JW Caldwell, and PA Kollman. Second generation
force eld for the simulation of proteins, nucleic acids, and organic molecules.
J Am Chem Soc, 117:51795197, 1995.
[15] WL DeLano. The PyMol molecular graphics system. DeLano Scientic LLC,
San Carlos, Ca, USA.
[16] KA Dill and HS Chan. From Levinthal to pathways to funnels. Nat Struct
Biol, 4(1):1019, Jan 1997.
[17] OA Donini and PA Kollman. Calculation and prediction of binding free
energies for the matrix metalloproteinases. J Med Chem, 43(22):41804188,
Nov 2000.
[18] M Ekroos and T Sjogren. Structural basis for ligand promiscuity in cy-
tochrome P450 3A4. Proc Natl Acad Sci U S A, 103(37):1368213687, Sep
2006.
[19] AM Ferrari, BQ Wei, LCostantino, and BK Shoichet. Soft docking and mul-
tiple receptor conformations in virtual screening. J Med Chem, 47(21):5076
5084, Oct 2004.
[20] D Fischer, SL Lin, HL Wolfson, and R Nussinov. A geometry-based suite of
molecular docking processes. J Mol Biol, 248(2):459477, Apr 1995.
[21] E Fischer. Einuss der conguration auf die wirkung derenzyme. Ber Dt
Chem Ges, 27:2985 2993, 1894.
[22] E Freire. The propagation of binding interactions to remote sites in proteins:
Analysis of the binding of the monoclonal antibody d1.3 to lysozyme. Proc
Natl Acad Sci U S A, 96(18):10118 10122, 1999.
Bibliography 132
[23] RA Friesner, JL Banks, RB Murphy, T A Halgren, JJ Klicic, DT Mainz,
MP Repasky, EH Knoll, M Shelley, JK Perry, DE Shaw, P Francis, and
PS Shenkin. Glide: a new approach for rapid, accurate docking and scoring.
1. method and assessment of docking accuracy. J Med Chem, 47(7):1739
1749, March 2004.
[24] RA Friesner, RB Murphy, MP Repasky, LL Frye, JR Greenwood, TA Hal-
gren, PC Sanschagrin, and DT Mainz. Extra precision Glide: docking and
scoring incorporating a model of hydrophobic enclosure for protein-ligand
complexes. J Med Chem, 49(21):61776196, Oct 2006.
[25] HA Gabb, RM Jackson, and MJ Sternberg. Modelling protein docking using
shape complementarity, electrostatics and biochemical information. J Mol
Biol, 272(1):106120, Sep 1997.
[26] P Gadakar, S Phukan, P Dattatreya, and V Balaji. Pose prediction accuracy
in docking studies and enrichment of actives in the active site of gsk-3beta.
J Chem Inf Model, Jun 2007.
[27] L Gales, S Macedo-Ribeiro, G Arsequell, G Valencia, MJ Saraiva, and
AM Damas. Human transthyretin in complex with iododiunisal: structural
features associated with a potent amyloid inhibitor. Biochem J, 388(2):615
621, Jun 2005.
[28] J Gasteiger and M Marsili. Iterative partial equalization of or-
bital electronegativitya rapid access to atomic charges. Tetrahedron,
36(22):32193228, 1980.
[29] F Glaser, DM Steinberg, IA Vakser, and N Ben-Tal. Residue frequencies
and pairing preferences at protein-protein interfaces. Proteins, 43(2):89102,
May 2001.
[30] M Glick and A Goldblum. A novel energy-based stochastic method for posi-
tioning polar protons in protein structures from x-rays. Proteins, 38(3):273
287, Feb 2000.
[31] M Glick, Anwar Rayan, and A Goldblum. A stochastic algorithm for global
optimization and for best populations: a test case of side chains in proteins.
Proc Natl Acad Sci U S A, 99(2):703708, Jan 2002.
[32] DS Goodsell, GM Morris, and AJ Olson. Automated docking of exible
ligands: applications of autodock. J Mol Recognit, 9(1):1 5, Jan-Feb 1996.
[33] DS Goodsell and AJ Olson. Automated docking of substrates to proteins by
simulated annealing. Proteins, 8(3):195202, 1990.
Bibliography 133
[34] I Halperin, BY Ma, H Wolfson, and R Nussinov. Principles of docking: An
overview of search algorithms and a guide to scoring functions. Proteins,
47(4):409 443, 2002.
[35] JA Hamilton and MD Benson. Transthyretin: a review from a structural
perspective. Cell Mol Life Sci, 58(10):14911521, Sep 2001.
[36] JA Hartigan. Clustering Algorithms. John Wiley & Sons, Inc., 1975.
[37] D Herschlag. The role of induced t and conformational-changes of enzymes
in specicity and catalysis. Bioorg Chem, 16(1):62 96, 1988.
[38] TL Hill. Steric eects. i van der waals potential energy curves. J Chem Phys,
16:399, 1948.
[39] X Hu and WH Shelver. Docking studies of matrix metalloproteinase in-
hibitors: zinc parameter optimization to improve the binding free energy
prediction. J Mol Graph Model, 22(2):115126, Nov 2003.
[40] MN James, A Sielecki, F Salituro, DH Rich, and T Hofmann. Confor-
mational exibility in the active sites of aspartyl proteinases revealed by a
pepstatin fragment binding to penicillopepsin. Proc Natl Acad Sci U S A,
79(20):61376141, Oct 1982.
[41] J Janin and C Chothia. The structure of protein-protein recognition sites.
J Biol Chem, 265(27):1602716030, Sep 1990.
[42] G Jones, P Willett, RC Glen, AR Leach, and R Taylor. Development and
validation of a genetic algorithm for exible docking. J Mol Biol, 267(3):727
48, 1997.
[43] A Kahraman, RJ Morris, RA Laskowski, and JM Thornton. Shape variation
in protein binding pockets and their ligands. J Mol Biol, 368(1):283301,
Apr 2007.
[44] P Kallblad, RL Mancera, and NP Todorov. Assessment of multiple binding
modes in ligand-protein docking. J Med Chem, 47(13):33343337, Jun 2004.
[45] CD Kirkpatrick. Optimization by simulated annealing. Science, 220:671
680, 1983.
[46] RM Knegtel, ID Kuntz, and CM Oshiro. Molecular docking to ensembles of
protein structures. J Mol Biol, 266(2):424440, Feb 1997.
[47] RM A Knegtel, DM Bayada, RA Engh, W von der Saal, VJ van Geerestein,
and PD J Grootenhuis. Comparison of two implementations of the incremen-
tal construction algorithm in exible docking of thrombin inhibitors. Angew
Chem Int Ed, 13(2):167183., 1999.
Bibliography 134
[48] DE Koshland. Application of a theory of enzyme specicity to protein syn-
thesis. Proc Natl Acad Sci U S A, 44(2):98104, February 1958.
[49] B Kramer, M Rarey, and T Lengauer. Evaluation of the FLEXX incremental
construction algorithm for protein-ligand docking. Proteins, 37(2):228241,
Nov 1999.
[50] RT Kroemer, A Vulpetti, JJ McDonald, DC Rohrer, JY Trosset, F Gior-
danetto, S Cotesta, C McMartin, M Kihlen, and PFW Stouten. Assessment
of docking poses: interactions-based accuracy classication (IBAC) versus
crystal structure deviations. J Chem Inf Comput Sci, 44(3):871881, 2004.
[51] M Kumar and MV Hosur. Adaptability and exibility of HIV-1 protease.
Eur J Biochem, 270(6):1231 1239, 2003.
[52] S Kumar, B Ma, CJ Tsai, N Sinha, and R Nussinov. Folding and binding
cascades: dynamic landscapes and population shifts. Protein Sci, 9(1):1019,
Jan 2000.
[53] ID Kuntz, JM Blaney, SJ Oatley, R Langridge, and TE Ferrin. A geometric
approach to macromolecule-ligand interactions. J Mol Biol, 161(2):269 88,
Oct 25 1982.
[54] AR Leach. Molecular Modelling. Principles and Applications, chapter Em-
prical Force Field Models: Molecular Mechanics, pages 165 252. Prentice
Hall, 2001.
[55] BM Lee, J Xu, BK Clarkson, MA Martinez-Yamout, HJ Dyson, DA Case,
JM Gottesfeld, and PE Wright. Induced t and lock and key recognition of
5S RNA by zinc ngers of transcription factor IIIA. J Mol Biol, 357(1):275
291, 2006.
[56] PE Leopold, M Montal, and JN Onuchic. Protein folding funnels: A kinetic
approach to the sequence-structure relationship. Proc Natl Acad Sci U S A,
89(18):87218725, September 1992.
[57] PJ Lewis, M de Jonge, F Daeyaert, L Koymans, M Vinkers, J Heeres, PAJ
Janssen, E Arnold, K Das, AD Clark, SH Hughes, PL Boyer, M Bethune,
R Pauwels, K Andries, M Kukla, and D Ludovici. On the detection of
multiple-binding modes of ligands to proteins, from biological, structural,
and modeling data. J Comput Aided Mol Des, 17(2 4):129134, 2003.
[58] JH Lii and NL Allinger. Directional hydrogen bonding in the MM3 force
eld. J Comp Chem, 19(9):1001 1016, 1998.
[59] B Lovejoy, AR Welch, S Carr, C Luong, C Broka, and T et al. Hendricks.
Crystal structures of MMP-1 and -13 reveal the structural basis for selectivity
of collagenase inhibitors. Nat Struct Biol, 6(3):217 221, 1999.
Bibliography 135
[60] H Lu, J Macosko, D Habel-Rodriguez, RW Keller, JA Brozik, and DJ Keller.
Closing of the ngers domain generates motor forces in the hiv reverse tran-
scriptase. J Biol Chem, 279(52):5452954532, Dec 2004.
[61] BH Luo, TA Springer, and J Takagi. High anity ligand binding by integrins
does not involve head separation. J Biol Chem, 278(19):1718517189, May
2003.
[62] B Ma, S Kumar, CJ Tsai, and R Nussinov. Folding funnels and binding
mechanisms. Protein Eng, 12(9):713720, Sep 1999.
[63] B Ma, M Shatsky, HJ Wolfson, and R Nussinov. Multiple diverse ligands
binding at a single protein site: A matter of pre-existing populations. Prot
Sci, 11(2):184 197, 2002.
[64] AD Mackerell. Empirical force elds for biological macromolecules: overview
and issues. J Comput Chem, 25(13):15841604, Oct 2004.
[65] AD Mackerell, D Bashford, M Bellott, R L Dunbrack, JD Evanseck,
MJ Field, S Fischer, J Gao, H Guo, S Ha, D Joseph-Mccarthy, L Kuchnir,
K Kuczera, FT K Lau, C Mattos, S Michnick, T Ngo, DT Nguyen, B Prod-
hom, WE Reiher, B Roux, M Schlenkrich, JC Smith, R Stote, J Straub,
M Watanabe, J Wiorkiewicz-Kuczera, D Yin, and M Karplus. All-atom em-
pirical potential for molecular modeling and dynamics studies of proteins. J
Phys Chem B, 102(18):35863616, April 1998.
[66] TG Marshall, RE Lee, and FE Marshall. Common angiotensin receptor
blockers may directly modulate the immune system via VDR, PPAR and
CCR2b. Theor Biol Med Model, 3:1, 2006.
[67] BW Matthews. Protein structure initiative: getting into gear. Nat Struct
Mol Biol, 14(6):459460, Jun 2007.
[68] C McMartin and RS Bohacek. QXP: powerful, rapid computer algorithms
for structure-based drug design. J Comput Aided Mol Des, 11(4):333344,
Jul 1997.
[69] S Miyazawa and RL Jernigan. A new substitution matrix for protein se-
quence searches based on contact frequencies in protein structures. Protein
Eng, 6(3):267278, Apr 1993.
[70] GM Morris, DS Goodsell, RS Halliday, R Huey, WE Hart, RK Belew, and
AJ Olson. Automated docking using a lamarckian genetic algorithm and an
empirical binding free energy function. J Comp Chem, 19(14):1639 1662,
1998.
Bibliography 136
[71] GM Morris, DS Goodsell, R Huey, and AJ Olson. Distributed automated
docking of exible ligands to proteins: Parallel applications of autodock 2.4.
J Comput Aid Mol Des, 10(4):293 304, 1996.
[72] A Murcko and MA Murcko. Computational methods to predict binding free
energy in ligand-receptor complexes. J Med Chem, 38(26):49534967, Dec
1995.
[73] R Najmanovich, J Kuttner, V Sobolev, and M Edelman. Side-chain exibility
in proteins upon ligand binding. Proteins, 39(3):261268, May 2000.
[74] R Norel, SL Lin, HJ Wolfson, and R Nussinov. Shape complementarity at
protein-protein interfaces. Biopolymers, 34(7):933940, Jul 1994.
[75] J Norvell and JM Berg. The protein structure initiative, ve years later.
Scientist, 19(20):30 31, 2005.
[76] E Noy, T Tabakman, and A Goldblum. Constructing ensembles of exible
fragments in native proteins by iterative stochastic elimination is relevant to
proteinprotein interfaces. Proteins, 68:702 711, 2007.
[77] R Nussinov and HJ Wolfson. Ecient computational algorithms for docking
and for generating and matching a library of functional epitopes i rigid and
exible hinge-bending docking algorithms. Comb Chem High Throughput
Screen, 2(5):249 59, 1999.
[78] R Nussinov and HJ Wolfson. Ecient computational algorithms for docking
and for generating and matching a library of functional epitopes ii. computer
vision-based techniques for the generation and utilization of functional epi-
topes. Comb Chem High Throughput Screen, 2(5):261269, Oct 1999.
[79] VD Ozrin, MV Subbotin, and SM Nikitin. Plass: protein-ligand anity
statistical scorea knowledge-based force-eld model of interaction derived
from the pdb. J Comput Aided Mol Des, 18(4):261270, Apr 2004.
[80] C Pargellis, L Tong, L Churchill, PF Cirillo, T Gilmore, AG Graham,
PM Grob, ER Hickey, N Moss, S Pav, and J Regan. Inhibition of p38
map kinase by utilizing a novel allosteric binding site. Nat Struct Biol,
9(4):268272, Apr 2002.
[81] P De La Paz, Burridge, SJ JM Oatley, and CCF. Blake. Multiple modes
of binding of thyroid hormones and other iodothyronines to human plasma
transthyretin., chapter Multiple modes of binding of thyroid hormones and
other iodothyronines to human plasma transthyretin., pages 119 172. 1992.
[82] DA Pearlman. Free Energy Calculations in Rational Drug Design, chapter
Theory, pages 9 35. Springer, 2001.
Bibliography 137
[83] E Perola and PS Charifson. Conformational analysis of drug-like molecules
bound to proteins: an extensive study of ligand reorganization upon binding.
J Med Chem, 47(10):24992510, May 2004.
[84] E Perola, WP Walters, and PS Charifson. A detailed comparison of cur-
rent docking and scoring methods on systems of pharmaceutical relevance.
Proteins, 56(2):235249, Aug 2004.
[85] M Rarey, B Kramer, and T Lengauer. The particle concept: placing dis-
crete water molecules during protein-ligand docking predictions. Proteins,
34(1):17 28, 1999.
[86] A Rayan, E Noy, D Chema, i A Levitzk, and A Goldblum. Stochastic
algorithm for kinase homology model construction. Cur Med Chem, 11:675
692, 2004.
[87] A Rayan, H Senderowitz, and A Goldblum. Exploring the conformational
space of cyclic peptides by a stochastic search method. J Mol Graph Model,
22(5):319333, May 2004.
[88] TJ Rydel, A Tulinsky, W Bode, and R Huber. Rened structure of the
hirudin-thrombin complex. J Mol Biol, 221(2):583601, Sep 1991.
[89] B Sandak, R Nussinov, and HJ Wolfson. An automated computer vision and
robotics-based technique for 3-d exible biomolecular docking and matching.
Comput Appl Biosci, 11(1):8799, Feb 1995.
[90] B Sandak, R Nussinov, and HJ Wolfson. A method for biomolecular struc-
tural recognition and docking allowing conformational exibility. J Comput
Biol, 5(4):631654, 1998.
[91] DM Schulz, C Ihling, GM Clore, and A Sinz. Mapping the topology
and determination of a low-resolution three-dimensional structure of the
calmodulin-melittin complex by chemical cross-linking and high-resolution
fticrms: direct demonstration of multiple binding modes. Biochemistry,
43(16):47034715, Apr 2004.
[92] J Singh, Z Deng, G Narale, and C Chuaqui. Structural interaction n-
gerprints: a new approach to organizing, mining, analyzing, and designing
protein-small molecule complexes. Chem Biol Drug Des, 67(1):512, January
2006.
[93] FJ Solis and RJ-B Wets. Minimization by random search techniques. Math
Oper Res, 6:1930, 1981.
[94] CA Sotrier and I Dramburg. In situ cross-docking to simultaneously
address multiple targets. J Med Chem, 48(9):31223125, May 2005.
Bibliography 138
[95] SF Sousa, PA Fernandes, and MJ Ramos. Protein-ligand docking: current
status and future challenges. Proteins, 65(1):1526, Oct 2006.
[96] RD Taylor, PJ Jewsbury, and JW Essex. FDS: exible ligand and receptor
docking with a continuum solvent model and soft-core energy function. J
Comput Chem, 24(13):16371656, Oct 2003.
[97] SJ Teague. Implications of protein exibility for drug discovery. Nat Rev
Drug Discov, 2(7):527541, Jul 2003.
[98] GE Terp, IT Christensen, and FS Jrgensen. Structural dierences of matrix
metalloproteinases. homology modeling and energy minimization of enzyme-
substrate complexes. J Biomol Struct Dyn, 17(6):933946, Jun 2000.
[99] A Tovchigrechko and IA Vakser. How common is the funnel-like energy
landscape in protein-protein interactions? Protein Sci, 10(8):15721583,
Aug 2001.
[100] CJ Tsai, S Kumar, B Ma, and R Nussinov. Folding funnels, binding funnels,
and protein function. Protein Sci, 8(6):11811190, Jun 1999.
[101] S Vajda, Z Weng, R Rosenfeld, and C DeLisi. Eect of conformational ex-
ibility and solvation on receptor-ligand binding free energies. Biochemistry,
33(47):1397713988, Nov 1994.
[102] IA Vakser. Low-resolution docking: prediction of complexes for underdeter-
mined structures. Biopolymers, 39(3):455464, Sep 1996.
[103] IA Vakser, OG Matar, and CF Lam. A systematic study of low-resolution
recognition in proteinprotein complexes. Proc Natl Acad Sci U S A,
96(15):84778482, Jul 1999.
[104] ADJ van Dijk and AMJ Bonvin. Solvated docking: introducing water into
the modelling of biomolecular complexes. Bioinformatics, 22(19):23402347,
Oct 2006.
[105] GM Verkhivker, PA Rejto, DK Gehlhaar, and ST Freer. Exploring the energy
landscapes of molecular recognition by a genetic algorithm: analysis of the
requirements for robust docking of hiv-1 protease and fkbp-12 complexes.
Proteins, 25(3):342353, Jul 1996.
[106] DF Wang, O Wiest, P Helquist, HY Lan-Hargest, and NL Wiech. On the
function of the 14 a long internal cavity of histone deacetylase-like protein:
implications for the design of histone deacetylase inhibitors. J Med Chem,
47(13):34093417, Jun 2004.
[107] J Wang, PA Kollman, and ID Kuntz. Flexible ligand docking: a multistep
strategy approach. Proteins, 36(1):1 19, 1999.
Bibliography 139
[108] J Wang, P Morin, W Wang, and PA Kollman. Use of mm-pbsa in reproduc-
ing the binding free energies to hiv-1 rt of tibo derivatives and predicting
the binding mode to hiv-1 rt of efavirenz by docking and mm-pbsa. J Am
Chem Soc, 123(22):52215230, Jun 2001.
[109] R Wang, Y Lu, and S Wang. Comparative evaluation of 11 scoring functions
for molecular docking. J Med Chem, 46(12):22872303, Jun 2003.
[110] GL Warren, CW Andrews, AM Capelli, B Clarke, J LaLonde, MH Lambert,
M Lindvall, N Nevins, SF Semus, SSenger, G Tedesco, ID Wall, JM Woolven,
CE Peisho, and Martha S Head. A critical assessment of docking programs
and scoring functions. J Med Chem, 49(20):59125931, Oct 2006.
[111] PK Weiner and PA Kollman. Amber: Assisted model building with energy
renement. a general program for modeling molecules and their interactions.
J Comp Chem, 2, 1981.
[112] SJ Weiner, PA Kollman, DA Case, UC Singh, C Ghio, G Alagona, S Profeta,
and P Weiner. A new force eld for molecular mechanical simulation of
nucleic acids and proteins. J Am Chem Soc, 106(3):765784, 1984.
[113] Wikipedia. Interaction energy of argon dimer.
[114] Z Xiang and B Honig. Extending the accuracy limits of prediction for side-
chain conformations. J Mol Biol, 311(2):421430, Aug 2001.
[115] C Zhang, J Chen, and C DeLisi. Protein-protein recognition: exploring the
energy funnels near the binding sites. Proteins, 34(2):255267, Feb 1999.
[116] L Zdek, MV Novotny, and MJ Stone. Increased protein backbone con-
formational entropy upon hydrophobic ligand binding. Nat Struct Biol,
6(12):11181121, Dec 1999.
Hebrew abstract
140

Anda mungkin juga menyukai