Anda di halaman 1dari 18

Combinatorial Chemistry & High Throughput Screening, 2006, 9, 213-228

213

Computational Methods in Developing Quantitative Structure-Activity Relationships (QSAR): A Review


' Arkadiusz Z. Dudek*,a, Tomasz Arodzb and Jorge Glvezc
a b c

University of Minnesota Medical School, Minneapolis, MN 55455, USA Institute of Computer Science, AGH University of Science and Technology, al. Mickiewicza 30, 30-059 Krakw, Poland

Unit of Drug Design and Molecular Connectivity Research, University of Valencia, 46100 Burjassot, Valencia, Spain
Abstract: Virtual filtering and screening of combinatorial libraries have recently gained attention as methods complementing the high-throughput screening and combinatorial chemistry. These chemoinformatic techniques rely heavily on quantitative structure-activity relationship (QSAR) analysis, a field with established methodology and successful history. In this review, we discuss the computational methods for building QSAR models. We start with outlining their usefulness in high-throughput screening and identifying the general scheme of a QSAR model. Following, we focus on the methodologies in constructing three main components of QSAR model, namely the methods for describing the molecular structure of compounds, for selection of informative descriptors and for activity prediction. We present both the well-established methods as well as techniques recently introduced into the QSAR domain.

Keywords: QSAR, molecular descriptors, feature selection, machine learning. 1. INTRODUCTION High throughput screening (HTS) has been a major recent technological improvement in drug discovery pipeline. In conjunction with combinatorial chemistry, it allows for synthesis and rapid activity assessment of vast number of small-molecule compounds [1, 2]. As the experience with these technologies matured, the focus has shifted from sifting through large, diverse molecule collections to more rationally designed libraries [3]. With this need for knowledge-guided screening of compounds, the virtual filtering and screening have been recognized as techniques complementary to high-throughput screening [4, 5]. To much extent, these techniques rely on quantitative structure-activity relationship (QSAR) analysis, which is in constant advancement since the works of Hansch [6] in early 1960s. The QSAR methodology focuses on finding a model, which allows for correlating the activity to structure within a family of compounds. Such models can be used to increase the effectiveness of HTS in several ways [7, 8]. QSAR studies can reduce the costly failures of drug candidates in clinical trials by filtering the combinatorial libraries. Virtual filtering can eliminate compounds with predicted toxic of poor pharmacokinetic properties [9, 10] early in the pipeline. It also allows for narrowing the library to drug-like or lead-like compounds [11] and eliminating the frequent-hitters, i.e., compounds that show unspecific activity in several assays and rarely result in leads [12]. Including such considerations at an early stage results in multidimensional optimization, with high activity as an essential but not only goal [8].
*Address correspondence to this author at the University of Minnesota, Division of Hematology, Oncology and Transplantation, 420 Delaware St. SE, MMC 480, Minneapolis, MN 55455, USA; Tel: +1 612 624-0123; Fax: +1 612 625-6919; E-mail: dudek002@umn.edu 1386-2073/06 $50.00+.00

Considering activity optimization, building targetspecific structure-activity models based on identified hits can guide HTS by rapidly screening the library for most promising candidates. Such focused screening can reduce the number of experiments and allow for use of more complex, low throughput assays [7]. Interpretation of created models gives insight into the chemical space in proximity of the hit compound. Feedback loops of high-throughput and virtual screening, resulting in sequential screening approach [13], allow therefore for more rational progress towards high quality lead compounds. Later in the drug discovery pipeline, accurate QSAR models constructed on the basis of the lead series can assist in optimizing the lead [14]. The importance and difficulty of the above-described tasks facing QSAR models has inspired many chemoinformatics researchers to borrow from recent developments in various fields, including pattern recognition, molecular modeling, machine learning and artificial intelligence. This results in large family of conceptually different methods being used for creating QSARs. The purpose of this review is to guide the reader through the diversity of the techniques and algorithms for developing successful QSAR models. 1.1. General Scheme of a QSAR Study The chemoinformatic methods used in building QSAR models can be divided into three groups, i.e., extracting descriptors from molecular structure, choosing those informative in the context of the analyzed activity, and, finally, using the values of the descriptors as independent variables to define a mapping that correlates them with the activity in question. The typical QSAR system realizes these phases, as depicted in Fig. 1. Generation of Molecular Descriptors from Structure The small-molecule compounds are defined by their structure, encoded as a set of atoms and covalent bonds
2006 Bentham Science Publishers Ltd.

214

Combinatorial Chemistry & High Throughput Screening, 2006, Vol. 9, No. 3

Dudek et al.

Fig. (1). Main stages of a QSAR study. The molecular structure is encoded using numerical descriptors. The set of descriptors is pruned to select the most informative ones. The activity is derived as a function of the selected descriptors.

between them. However, the structure cannot be directly used for creating structure-activity mappings for reasons stemming from chemistry and computer science. First, the chemical structures do not usually contain in an explicit form the information that relates to activity. This information has to be extracted from the structure. Various rationally designed molecular descriptors accentuate different chemical properties implicit in the structure of the molecule. Only those properties may correlate more directly with the activity. Such properties range from physicochemical and quantum-chemical to geometrical and topological features. The second, more technical reason, which guides the use and development of molecular descriptors, stems from the paradigm of feature space prevailing in statistical data analysis. Most methods employed to predict the activity require as input numerical vectors of features of uniform length for all molecules. Chemical structures of compounds are diverse in size and nature and as such do not fit into this model directly. To circumvent this obstacle, molecular descriptors convert the structure to the form of well-defined sets of numerical values. Selection of Relevant Molecular Descriptors Many applications are capable of generating hundreds or thousands of different molecular descriptors. Typically, only some of them are significantly correlated with the activity. Furthermore, many of the descriptors are intercorrelated. This has negative effects on several aspects of QSAR analysis. Some statistical methods require that the number of compounds is significantly greater than the number of descriptors. Using large descriptor sets would require large datasets. Other methods, while capable of handling datasets with large descriptors to compounds ratios, nonetheless suffer from loss of accuracy, especially for compounds unseen during the preparation of the model. Large number of descriptors also affects interpretability of the final model. To tackle these problems, a wide range of methods for

automated narrowing of the set of descriptors to the most informative ones is used in QSAR analysis. Mapping the Descriptors to Activity Once the relevant molecular descriptors are computed and selected, the final task of creating a function between their values and the analyzed activity can be carried out. The value quantifying the activity is expressed as a function of the values of the descriptors. The most accurate mapping function from some wide family of functions, is usually fitted based on the information available in the training set, i.e., compounds for which the activity is known. A wide range of mapping function families can be used, including linear or non-linear ones, and many methods for carrying out the training to obtain the optimal function can be employed. 1.2. Organization of the Review In the following sections of this review, we discuss each of the three groups of methods in more detail. We start with various types of molecular descriptors in section 2. Next, in section 3 we focus on automatic methods for choosing the most predictive descriptors from a large initial set. Then, in section 4 we describe the linear and non-linear mapping techniques used for expressing the activity or property as a function of the values of selected descriptors. For each technique, we provide an overview of the method followed by examples of its applications in QSAR analysis. 2. MOLECULAR DESCRIPTORS Molecular descriptors map the structure of the compound into a set of numerical or binary values representing various molecular properties that are deemed to be important for explaining activity. Two broad families of descriptors can be distinguished, based on the dependence on the information about 3D orientation and conformation of the molecule. 2.1. 2D QSAR Descriptors The broad family of descriptors used in the 2D-QSAR approach share a common property of being independent from the 3D orientation of the compound. These descriptors

Computational Methods in Developing QSAR

Combinatorial Chemistry & High Throughput Screening, 2006, Vol. 9, No. 3 215

range from simple measures of entities constituting the molecule, through its topological and geometrical properties to computed electrostatic and quantum-chemical descriptors or advanced fragment-counting methods. Constitutional Descriptors Constitutional descriptors capture properties of the molecule that are related to elements constituting its structure. These descriptors are fast and easy to compute. Examples of constitutional descriptors include molecular weight, total number of atoms in the molecule and numbers of atoms of different identity. Also, a number of properties relating to bonds are used, including total numbers of single, double, triple or aromatic type bonds, as well as number of aromatic rings. Electrostatic and Quantum-Chemical Descriptors Electrostatic descriptors capture information on electronic nature of the molecule. These include descriptors containing information on atomic net and partial charges [15]. Descriptors for highest negative and positive charges are also informative, as well as molecular polarizability [16]. Partial negatively or positively charged solvent-accessible atomic surface areas have also been used as informative electrostatic descriptors for modeling intermolecular hydrogen bonding [17]. Energies of highest occupied and lowest unoccupied molecular orbital form useful quantum-

chemical descriptors [18], as do the derivative quantities such as absolute hardness [19]. Topological Descriptors The topological descriptors treat the structure of the compound as a graph, with atoms as vertices and covalent bonds as edges. Based on this approach, many indices quantifying molecular connectivity were defined, starting with Wiener index [20], which counts the total number of bonds in shortest paths between all pairs of non-hydrogen atoms. Other topological descriptors include Randic indices x [21], defined as sum of geometric averages of edge degrees of atoms within paths of given lengths, Balaban's J index [22] and Shultz index [23]. Information about valence electrons can be included in topological descriptors, e.g. Kier and Hall indices xv _ [24] or Glvez topological charge indices [25]. The first ones use geometric averages of valence connectivities along paths. The latter measure topological valences of atoms and net charges transfer between pair of atoms separated by a given number of bonds. To exemplify the derivation of topological indices, in Fig. 2 we show the calculation of Glvez indices for atom distance of a single bond. Descriptors combining connectivity information with other properties are also available, e.g. BCUT [26-28] descriptors, which take form of eigenvalues of atom

Fig. (2). Example of the Galvez first-order topological charge indices G 1 and J1 for isopentane. The matrix product Mij of atom adjacency matrix and topological distance matrix defined as inverse squares of inter-atom distances, is used to define the charge terms CTij as M ij - Mji. The Gk indices are defined as the algebraic sum of absolute values of the charge terms for pairs of atoms separated by k bonds. The Jk indices result from normalization of Gk by the number of bonds in the molecule.

216

Combinatorial Chemistry & High Throughput Screening, 2006, Vol. 9, No. 3

Dudek et al.

connectivity matrix with atom charge, polarizability or Hbond potential values on diagonal and additional terms off diagonal. Similarly, the Topological Sub-Structural Molecular Design (TOSS-MODE/TOPS-MODE) [29, 30] rely on spectral moments of bond adjacency matrix amended with information on for e.g. bond polarizability. The atomtype electrotopological (E-state) indices [31, 32] use electronic and topological organization to define the intrinsic atom state and the perturbations of this state induced by other atoms. This information is gathered individually for a wide range of atom-types to form a set of indices. Geometrical Descriptors Geometrical descriptors rely on spatial arrangement of atoms constituting the molecule. These descriptors include information on molecular surface obtained from atomic van der Waals areas and their overlap [33]. Molecular volume may be obtained from atomic van der Waals volumes [34]. Principal moments of inertia and gravitational indices [35] also capture information on spatial arrangement of the atoms in molecule. Shadow areas, obtain by projection of the molecule to its two principal axes are also used [36]. Another geometrical descriptor is the total solvent-accessible surface area [37, 38]. Fragment-Based Descriptors and Molecular Fingerprints The family of descriptors relying on substructural motifs is often used, especially for rapid screening of very large databases. The BCI fingerprints [39] are derived as bits describing the presence or absence in the molecule of certain fragments, including atoms with their nearest neighborhoods, atom pairs and sequences, or ring-based fragments. A similar approach is present in the basic set of 166 MDL Keys [40]. However, other variants of the MDL Keys are also available, including extended sets of keys or compact sets. The latter are results of dedicated pruning strategies [41] or elimination methods, e.g. the Fast Random Elimination of Descriptors/Substructure Keys (FRED/SKEYS) [42]. Recently introduced Hologram QSAR (HQSAR) [43] approach is based on counting the number of occurrences of certain substructural paths of functional groups. For each group, cyclic redundancy code is calculated, which serves as a hashing function [44] for partitioning the substructural motifs into bins of hash table. The numbers of elements in the bins form a hologram. The Daylight fingerprints [45] are a natural extension of the fragment-based descriptors by eliminating the reliance on pre-defined list of sub-structure motifs. The fingerprint for each molecule is a string of bits. However, a structural motif in the molecule does not correspond to a single bit, but leads, through a hashing function, to a pattern of bits that are added to the fingerprint with a logical "or" operation. The bits in different patterns may overlap, due to the large number of possible patterns and a finite length of a bit string. Thus, the fact that a bit or several bits are set in the fingerprint cannot be interpreted as a proof of pattern's presence. However, if one of the bits corresponding to a given pattern is not set, this guarantees that the pattern is not present in the molecule. This allows for rapid filtering of molecules that do not possess certain structural motifs. The patterns are generated

individually for each molecule, and describe atoms with their neighborhoods and paths of up to 7 bonds. Other approaches than hashed fingerprints are also proposed to circumvent the problem of a pre-defined substructure library, e.g. algorithm for optimal discovery of frequent structural fragments relevant to given activity [46]. 2.2. 3D QSAR Descriptors The 3D-QSAR methodology is much more computationally complex than 2D-QSAR approach. In general, it involves several steps to obtain numerical descriptors of the compound structure. First, the conformation of the compound has to be determined either from experimental data or molecular mechanics and then refined by minimizing the energy [47, 48]. Next, the conformers in dataset have to be uniformly aligned in space. Finally, the space with immersed conformer is probed computationally for various descriptors. Some methods independent of the compound alignment have also been developed. 2.2.1. Alignment-Dependent 3D QSAR Descriptors The group of methods that require molecule alignment prior to the calculation of descriptors is strongly dependent on the information on the receptor for the modeled ligand. In case where such data is available, the alignment can be guided by studying the receptor-ligand complexes. Otherwise, purely computational methods for superimposing the structures in space have to be used [49, 50]. These methods rely e.g. on atom-atom or substructure-substructure mapping. Comparative Molecular Field Analysis The Comparative Molecular Field Analysis (CoMFA) [51] uses electrostatic (Coulombic) and steric (van der Waals) energy fields defined by the inspected compound. The aligned molecule is placed in a 3D grid. In each point of the grid lattice a probe atom with unit charge is placed and the potentials (Coulomb and Lennard-Jones) of the energy fields are computed. Then, they serve as descriptors in further analysis, typically using partial least squares regression. This analysis allows for identifying structure regions positively and negatively related to the activity in question. Comparative Molecular Similarity Indices Analysis The Comparative Molecular Similarity Indices (CoMSIA) [52] is similar to CoMFA in the aspect of atom probing throughout the regular grid lattice in which the molecules are immersed. The similarity between probe atom and the analyzed molecule are calculated. Compared to CoMFA, CoMSIA uses a different potential function, namely the Gaussian-type function. Steric, electrostatic, and hydrophobic properties are then calculated, hence the probe atom has unit hydrophobicity as additional property. The use of Gaussian-type potential function instead of Lennard-Jones and Coulombic functions allows for accurate information in grid points located within the molecule. In CoMFA, unacceptably large values are obtained in these points due to the nature of the potential functions and arbitrary cut-offs that have to be applied.

Computational Methods in Developing QSAR

Combinatorial Chemistry & High Throughput Screening, 2006, Vol. 9, No. 3 217

2.2.2. Alignment-Independent 3D QSAR Descriptors Another group of 3D descriptors are those invariant to molecule rotation and translation in space. Thus, no superposition of compounds is required. Comparative Molecular Moment Analysis The Comparative Molecular Moment Analysis (CoMMA) [53] uses second-order moments of the mass distribution and charge distributions. The moments relate to center of the mass and center of the dipole. The CoMMA descriptors include principal moments of inertia, magnitudes of dipole moment and principal quadrupole moment. Furthermore, descriptors relating charge to mass distributions are defined, i.e., magnitudes of projections of dipole upon principal moments of inertia and displacement between center of mass and center of dipole. Weighted Holistic Invariant Molecular Descriptors The Weighted Holistic Invariant Molecular (WHIM) [54, 55] and Molecular Surface WHIM [56] descriptors provide the invariant information by employing the principal component analysis (PCA) on the centered co-ordinates of the atoms constituting the molecule. This transforms the molecule into the space that captures the most variance. In this space, several statistics are calculated and serve as directional descriptors, including variance, proportions, symmetry and kurtosis. By combining the directional descriptors, non-directional descriptors are also defined. The contribution of each atom can be weighted by a chemical property, leading to different principal components capturing variance within the given property. The atoms can be weighted by mass, van der Waals volume, atomic electronegativity, atomic polarizability, electrotopological index of Kier and Hall and molecular electrostatic potential. VolSurf The VolSurf [57, 58] approach is based on probing the grid around the molecule with specific probes, for e.g. hydrophobic interactions or hydrogen bond acceptor or donor groups. The resulting lattice boxes are used to compute the descriptors relying on volumes or surfaces of 3D contours, defined by the same value of the probemolecule interaction energy. By using various probes and cut-off values for the energy, different molecular properties can be quantified. These include e.g. molecular volume and surface, and hydrophobic and hydrophilic regions. Derivative quantities, e.g. molecular globularity or factors relating the surface of hydrophobic or hydrophilic regions to surface of the whole molecule can also be computed. In addition, various geometry-based descriptors are also available, including energy minima distances or amphiphilic moments. Grid-Independent Descriptors The Grid-Independent Descriptors (GRIND) [59] have been devised to overcome the problems with interpretability common in alignment-independent descriptors. Similarly to VolSurf, it utilizes probing of the grid with specific probes. The regions showing the most favorable energies of interaction are selected, provided that the distances between the regions are large. Next, the probe-based energies are encoded in a way independent of the molecule's

arrangement. To this end, the distances between the nodes in the grid are discretized into a set of bins. For each distance bin, the nodes with the highest product of energies are stored and the value of the product serves as the numerical descriptor. In addition, the stored information on the position of the nodes can be used to track down the exact regions of the molecule relating to the given property. To extend the molecular information captured by the descriptors, the product of node energies may include not only energies relating to the same probe, but also from two different probe types. 2.3. The 2D- Versus 3D-QSAR Approach It is generally assumed that 3D approaches are superior to 2D in drug design. Yet, studies show such an assumption may not always hold. For example, the results of conventional CoMFA may often be non-reproducible due to dependence of the outputs' quality on the orientation of the rigidly aligned molecules on user's terminal [60, 61]. Such alignment problems are typical in 3D approaches and even though some solutions have been proposed, the unambiguous 3D alignment of structurally diverse molecules still remains a difficult task. Moreover, the distinction between 2D- and 3D-QSAR approaches is not a crisp one, especially when alignmentindependent descriptors are considered. This can be observed when comparing the BCUT with the WHIM descriptors. Both employ a similar algebraic method, i.e., solving an eigenproblem for a matrix describing the compound - the connectivity matrix in case of BCUT descriptors and covariance matrix of 3D co-ordinates in case of WHIM. There is also a deeper connection between 3D-QSAR and one of 2D methods, the topological approach. It stems from the fact that the geometry of a compound in many cases depends on its topology. An elegant example was provided by Estrada et al., who demonstrated that the dihedral angles of biphenyl as a function of the substituents attached to it can be predicted by topological indices [62]. Along the same line, a supposedly typically 3D property, chirality, has been predicted using chiral topological indices [63], constructed by introducing an adequate weight into the topological matrix for the chiral carbons. 3. AUTOMATIC SELECTION OF RELEVANT MOLECULAR DESCRIPTORS Automatic methods for selecting the best descriptors, or features, to be used in construction of the QSAR model fall into two categories [64]. In the wrapper approach, the quality of descriptor subsets is obtained from constructing and evaluating a series of QSAR models. In filtering, no model is build, and features are evaluated using some other criteria. 3.1. Filtering Methods These techniques are applied independently of the mapping method used. They are executed prior to the mapping, to reduce the number of descriptors following some objective criteria, e.g. inter-descriptor correlation. Correlation-Based Methods Pearson's correlation coefficients may serve as a preliminary filter for discarding intercorrelated descriptors.

218

Combinatorial Chemistry & High Throughput Screening, 2006, Vol. 9, No. 3

Dudek et al.

This can be done by e.g. creating clusters of descriptors having correlation coefficients higher than certain threshold and retaining only one, randomly chosen member of each cluster [65]. Another procedure involves estimating correlations between pair of descriptors and, if it exceeds a threshold, randomly discarding one of the descriptors [66]. The choice of the ordering in which pairs are evaluated may lead to significantly different results. One popular method is to first rank the descriptors by using some criterion, and then iteratively browse the set starting from pairs containing the highest-ranking features. One such ranking may be the correlation ranking, based on correlation coefficient between activity and descriptors. However, correlation ranking is usually used in conjunction with principal component analysis [67, 68]. Methods using measures of correlation between activity and descriptors other than Pearson's have been used, notably the paircorrelation method [69-71]. Methods Based on Information Theory Information content of the descriptor is defined in terms of entropy of descriptor treated as a random variable. Based on this notion, various measures relating the information shared between two descriptors or between descriptor and the activity can be defined. An example of such measure, used in descriptor selection for QSAR, is the mutual information. The mutual information, sometimes referred to as information gain, quantifies the reduction of uncertainty, or information content, of activity variable by knowing the descriptor values. It is used in QSAR to rank the descriptors [72, 73]. The application of information-theoretic criteria is straightforward when both the descriptors and activity values are categorical. In case of continuous numerical variables, some discretization schemes have to be applied to approximate the variables. Thus, such criteria are usually used with binary descriptors, such as ones describing 3D properties of molecules used in thrombin dataset in Knowledge Discovery in Databases 2001 Cup. Statistical Criteria The Fisher's ratio, i.e., ratio of the between class variance to the within-class variance, can be used to rank the descriptors [74]. Next, the correlation between pairs of features is used, as described before, to reduce the set of descriptors. Another method used in assessing the quality of a descriptor is based on the Kolmogorov-Smirnov [75] statistics. As applied to descriptor selection in QSAR [76], it is a fast method not relying on the knowledge of the underlying distribution and not requiring the conversion of variables descriptors into categorical values. For two classes of activity to be predicted, the method measures the maximal absolute distance between cumulative distribution functions of the descriptor for individual activity classes. 3.2. Wrapper Methods These techniques operate in conjunction with a mapping algorithm [77]. The choice of best subset of descriptors is guided by the error of the mapping algorithm for a given

subset, measured e.g. with cross-validation. The schematic illustration of wrapper methods is given in Fig. 3.

Fig. (3). Generic scheme for wrapper descriptor selection methods. Iteratively, various configurations of selected and discarded descriptors are evaluated by creating a descriptors-to-activity mapping and assessing its prediction accuracy. The final descriptors are those yielding the highest accuracy for a given family of mapping functions.

Genetic Algorithm The Genetic Algorithms (GA) [78] are efficient methods for function minimization. In descriptor selection context, the prediction error of the model built upon a set of features is optimized [79, 80]. The genetic algorithm mimics the natural evolution by modeling a dynamic population of solutions. The members of the population, referred to as chromosomes, encode the selected features. The encoding usually takes form of bit strings with bits corresponding to selected features set and others cleared. Each chromosome leads to a model built using the encoded features. By using the training data, the error of the model is quantified and serves as a fitness function. During the course of evolution, the chromosomes are subjected to crossover and mutation. By allowing survival and reproduction of the fittest chromosomes, the algorithm effectively minimizes the error function in subsequent generations. The success of GA depends on several factors. The parameters steering the crossover, mutation and survival of chromosomes should be carefully chosen to allow the population to explore the solution space and to prevent early convergence to homogeneous population occupying a local minimum. The choice of initial population is also important in genetic feature selection. To address this issue, e.g. a method based on Shannon's entropy combined with graph analysis can be used [81].

Computational Methods in Developing QSAR

Combinatorial Chemistry & High Throughput Screening, 2006, Vol. 9, No. 3 219

Genetic Algorithms have been used in feature selection for QSAR with a range of mapping methods, e.g. Artificial Neural Networks [66, 82, 83], k-Nearest Neighbor method [84] and Random Forest [66]. Simulated Annealing Simulated Annealing (SA) [85] is another stochastic method for function optimization employed in QSAR [66, 86, 87]. As in the evolutionary approach, the function minimized represents the error of the model built using the subset of descriptors. The SA algorithm operates iteratively by finding a new subset of descriptors by altering the current-best one, e.g. by exchanging some percentage of the features. Next, SA evaluates prediction error of the new subset and makes the choice whether to adopt the new solution as the current optimal solution. This decision depends on whether the new solution leads to lower error than the current one. If so, the new solution is used. However, in other case, the solution is not automatically discarded. With a given probability, based on the Boltzmann distribution, the worse solution can replace the current, better one. Replacing the solution with a worse one allows the SA method to escape from local minima of the error function, i.e., solutions that cannot be made better without traversing through less-fitted feature subsets. The power of SA method stems from altering the temperature term in the Boltzmann distribution. At an early stage, when the solution is not yet highly optimized and mostly prone to encounter local minima, the temperature is high. During the course of algorithm, the temperature is lowered and acceptance of worse solutions is less likely. Thus, even if the obtained minimum is not global, it is nonetheless usually of high quality. Sequential Feature Forward Selection While genetic algorithm and simulating annealing rely on guided random process of exploring the space of feature subsets, Forward Feature Selection [88] operates in a deterministic manner. It implements a greedy search throughout the feature subsets. As a first step, a single feature that leads to best prediction is selected. Next, sequentially, each feature is individually added to the current subset and the errors of resulting models are quantified. The feature that is the best in reducing the error is incorporated into the subset. Thus, in each step a single best feature is added, resulting in a sequence of nested subsets of features. The procedure stops when a specified number of features is selected. More elaborate stopping conditions are also proposed, e.g. based on incorporating an artificial random feature [89]. When this feature is to be selected as the one that improves the best quality of the model, the procedure is stopped. The drawback of forward selection is that if several features collectively are good predictors but alone each is a poor prediction, none of the features may be chosen. The recursive feature forward selection has been used in several QSAR studies [65, 81, 90, 91]. Sequential Backward Feature Elimination The Backward Feature Elimination [88] is another example of a greedy, sequential method that yields nested subsets of features. In contrast to forward selection, the full

set of features is used as a starting point. Next, in each step, all subsets of features resulting from removal of a single feature are analyzed for the prediction error. The feature that leads to a model with highest error is removed from the current subset. The procedure stops when the given number of features are dropped. Backward elimination is slower than forward selection, yet often leads to better results. Recently, a significantly faster variant of backward elimination, the Recursive Feature Elimination [92] method has been proposed for Support Vector Machines (SVM). In this method, the feature to be removed is chosen based on a single execution of the learning method using all features remaining in the given iteration. The SVM allows for ranking the features according to their contribution to the result. Thus, the least contributing feature can be dropped to form a new, narrowed subset of features. There is no need to train SVMs for each subset as in original feature elimination method. Variants of backward feature elimination method have been used in numerous QSAR studies [93-97]. 3.3. Hybrid Methods In addition to the purely filter- or wrapper-based descriptor selection procedures, QSAR studies utilize the fusion of the two approaches. A rapid objective method is used as a preliminary filter to narrow the feature set. Next, one of the more accurate but slower subjective method is employed. As an example of such a combination of techniques, the correlation-based test significantly reducing the number of features followed by genetic algorithm or simulated annealing can be used [66]. A similar procedure, which uses a greedy sequential feature forward selection is also in use [65]. The feature selection can also be implicit in some mapping methods. For example, the Decision Tree (see section 4.2.4) utilizes only a subset of features in the decision process, if a single or only a few descriptors are tested at each node and the overall number of features exceeds the number of those used in the nodes. Similarly, ensembles of decision stumps (see section 4.2.6) also operate on reduced number of descriptors if the number of members in ensemble is smaller than the number of features. 4. MAPPING THE MOLECULAR STRUCTURE TO ACTIVITY Given the selected descriptors, the final step in building the QSAR model is to derive the mapping between the activity and the values of the features. Simple, yet useful methods model the activity as a linear function of the descriptors. Other, non-linear, methods extend this approach to more complex relations. Another important division of the mapping methods is based on the nature of the activity variable. In case of predicting a continuous value a regression problem is encountered. When only some categories of classes of the activity need to be predicted, e.g. partitioning compounds into active and inactive, the classification problem occurs. In regression, the dependent variable is modeled as a function of the descriptors, as noted above. In classification framework, the resulting model is defined by a decision

220

Combinatorial Chemistry & High Throughput Screening, 2006, Vol. 9, No. 3

Dudek et al.

Fig. (4). Approaches to QSAR mapping: a) linear regression with activity as a function of two descriptors, d1 and d 2, b) binary classification with linear decision boundary between classes of active (+) and inactive (-) compounds, c) non-linear regression, d) non-linear binary classification.

boundary, separating the classes in the descriptor space. The approaches to QSAR mapping are depicted in Fig. 4. 4.1. Linear Models Linear models have been the basis of QSAR analysis since it's beginning. They predict the activity as a linear function of molecular descriptors. In general, linear models are easily interpretable and sufficiently accurate for small datasets of similar compounds, especially when the descriptors are carefully selected for a given activity. 4.1.1. Multiple Linear Regression Multiple Linear Regression (MLR) models the activity to be predicted as a linear function of all descriptors. Based on the examples from the training set, the coefficients of the function are estimated. These free parameters are chosen to minimize the squares of the errors between the predicted and the actual activity. The main restriction of MLR analysis is the case of large descriptors-to-compounds ratio or multicollinear descriptors in general. This makes the problem ill-conditioned and makes the results unstable. Multiple linear regression is among the most widely used mapping methods in QSAR in last decades. It has been employed in conjunction with genetic description selection for modeling GABAA receptor binding, antimalaric activity, HIV-1 protease inhibition and glycogen phosphorylase inhibition [98], exhibiting lower cross-validation error than partial least squares, both using 4D-QSAR fingerprints. MLR has been applied to models in predictive toxicology [99, 100], Caco-2 permeability [101] and aqueous solubility [102]. In prediction of physicochemical properties [91, 96] and of COX-2 inhibition [103], MLR proved significantly worse than non-linear support vector machine, yet comparable or only slightly inferior to RBF neural network. However, in studies of logP [104], it proved worse than other models, including multi-layer perceptron and Decision Tree. 4.1.2. Partial Least Squares Partial Least Squares (PLS) [105-107] linear regression is a method suitable for overcoming the problems in MLR related to multicollinear or over-abundant descriptors. The technique assumes that despite the large number of descriptors the modeled process is governed by a relatively small number of latent independent variables. The PLS tries to indirectly obtain knowledge on the latent variables by decomposing the input matrix of descriptors into two components, the scores and the loadings. The scores are

orthogonal and, while being able to capture the descriptor information, allow also for good prediction of the activity. The estimation of score vectors is done iteratively. The first one is derived using the first eigenvector of the activitydescriptor combined variance-covariance matrix. Next, the descriptor matrix is deflated by subtracting the information explained by the first score vector. The resulting matrix is used in the derivation of the second score vector, which followed by consecutive deflation, closes the iteration loop. In each iteration step, the coefficient relating the score vector to the activity is also determined. PLS has been used successfully with 3D QSAR and HQSAR, e.g. in a study of nicotinic acetylocholine receptors binding modeling [108] and estrogen receptor binding [109]. It has also been used in a study [110] involving several dataset, including blood-brain barrier permeability, toxicity, P-glycoprotein transport, multidrug resistance reversal, CDK-2 antagonism, COX-2 inhibition, dopamine binding and log D, showing results better than decision trees in most cases, but usually worse than most recent techniques, ensembles and SVM. PLS regression has been tested in prediction of COX-2 inhibition [84], but showed lower accuracy than neural network and decision tree. However, in a study of solubility prediction [111], PLS outperformed a neural network. Studies report PLS-based models in melting point and logP prediction [112]. Finally, PLS models for BBB permeability [113, 114], mutagenicity, toxicity, tumor growth inhibition and anti-HIV activity [115] and aqueous solubility [112, 113] have been created. 4.1.3. Linear Discriminant Analysis Linear Discriminant Analysis (LDA) [116] is a classification method that creates a linear transformation of the original feature space into a space, which maximizes the interclass separability and minimizes the within-class variance. The procedure operates by solving a generalized eigenvalue problem based on the between-class and withinclass covariance matrices. Thus, the number of features has to be significantly smaller than the number of observations to avoid ill-conditioning of the eigenvalue problem. Executing principal component analysis to reduce the dimension of the input data may be employed prior to applying LDA to overcome this problem. LDA has been used to create QSAR models e.g. for prediction of model validity for new compounds [117], where it fared better than PLS, but worse than non-linear neural network. However, in blood-brain barrier

Computational Methods in Developing QSAR

Combinatorial Chemistry & High Throughput Screening, 2006, Vol. 9, No. 3 221

permeability prediction [114], LDA exhibited lower accuracy than PLS-based method. In predicting antibacterial activity [118, 119], it performed worse than neural network. LDA was also used to predict drug likeness [120], showing results slightly better than linear programming machine, a method similar to linear SVM. However, it yielded results worse than non-linear SVM and bagging ensembles. In ecotoxicity prediction [121], LDA performed better than other linear methods and k-NN, but inferior to decision trees. 4.2. Non-Linear Models Non-linear models extend the structure-activity relationships to non-linear functions of input descriptors. Such models may become more accurate, especially for large and diverse datasets. However, usually, they are harder to interpret. Complex, non-linear models may also fall prey to overfitting [122], i.e., low generalization to compounds unseen during training. 4.2.1. Bayes Classifier The Bayes Classifier stems from the Bayes rule relating the posterior probability of a class to its overall probability, the probability of the observations and the likelihood of a class with respect to observed variables. In Bayes rule, the class minimizing the posterior probability is chosen as the prediction result. However, in real problems, the likelihoods are not known and have to be estimated. Yet, given a finite number of training examples, such estimation is not trivial. One method to approach this problem is to make an assumption of independence of likelihoods of class with respect to different descriptors. This leads to the Naive Bayes Classifier (NBC). For typical datasets, the estimation of likelihoods with respect to single variables is feasible. The drawback of this method is that independence assumption usually does not hold. An extensive study using Naive Bayes Classifier in comparison with other methods was conducted [110] using numerous endpoints, including COX-2, CDK-2, BBB, dopamine, logD, P-glycoprotein, toxicity and multidrug resistance reversal. In most cases NBC was inferior to other methods, however it outperformed PLS for BBB and CDK2, k-NN for P-glycoprotein and COX-2, and decision trees for BBB and P-glycoprotein. In thrombin binding [72], NBC yielded worse results than SVM. However, NBC has been shown useful in modeling the inhibition of the HIV-1 protease [123]. 4.2.2. The k-Nearest Neighbor Method The k-Nearest Neighbor (k-NN) [124] is a simple decision scheme that requires practically no training and is asymptotically optimal, i.e., with increase in training data converges to the optimal prediction error. For a given compound in the descriptor space, the method analyzes its k nearest neighboring compounds from the training set and predicts the activity class that is most highly represented among these neighbors. The k-NN scheme is sensitive to the choice of metric and to the number of training compounds available. Also, the number of neighbors analyzed can be optimized to yield best results. The k-nearest neighbors scheme have been used e.g. for predicting COX-2 inhibition [84], where it showed accuracy

higher than PLS and similar to neural network. Anticonvulsant activity, dopamine D1 antagonists and aquatic toxicity have also been modeled using this method [86]. In a study on P-glycoprotein transport activity [94], kNN performed comparably to decision tree, but worse than neural network and SVM. In ecotoxicity QSAR [121], k-NN was better than some linear methods, but inferior to discriminant analysis and decision trees. 4.2.3. Artificial Neural Networks Artificial Neural Networks (ANN) [125] are biologicallyinspired prediction methods based on the architecture of a network of neurons. A wide range of specific models based on this paradigm have been analyzed in literature, with perceptron-based and radial-basis function-based ones prevailing. These two methods both fall into the category of feed-forward networks, in which, during the prediction, the information flows only in direction from the input descriptors, through a set of layers, to the output of the network. Multi-Layer Perceptron The Multi-Layer Perceptron (MLP) model consists of a layered network of interconnected perceptrons, i.e., simple models of a neuron [126]. Each perceptron is capable of making a linear combination of its input values and, by means of a certain transfer function, produce a binary or continuous output. A noteworthy fact is that each input of the perceptron has an adaptive weight specifying the importance of the input. In training of a single perceptron, the inputs of the perceptron are formed by the molecular descriptors, while the output should predict the activity of the compound. To achieve this goal, the perceptron is trained by adjusting the weights, to produce a linear combination of the descriptors that optimally predicts the activity. The adjusting process relies on the feedback from comparing the predicted with the expected output. That is, the error in the prediction is propagated to the weights of the descriptors, altering them in the direction that counters the error. While a single perceptron is a linear model, a network consisting of layers of perceptrons, with output of one layer connected to inputs of neurons in consecutive layer, allows for non-linear prediction [127]. Multi-layer networks contain a single input layer, which consists simply of the values of molecular descriptors, one or more hidden layers, which process the descriptors into internal representations and an output layer utilizing the internal representation to produce the final prediction. This architecture is depicted in Fig. 5. In multi-layer networks, training, i.e., the adjustment of the weights becomes non-trivial. Apart from the output layer, the feedback information is no longer directly available to adjust the weights of neuron inputs in the hidden layers. One popular method to overcome this problem is the backward propagation of error method. The weights of inputs in the output layer neurons are adjusted based on the error as in single perceptron. Then, the information of the error propagates from the output layer neurons to the neurons in the preceding layer proportionally to the weight of the link between given hidden neuron output and the input of the output layer neuron. It is then used to adjust the weights of the inputs of the neurons in the hidden layer. The contribution to the overall error propagates backwards

222

Combinatorial Chemistry & High Throughput Screening, 2006, Vol. 9, No. 3

Dudek et al.

through all hidden layers until the weights of the first hidden layer are adjusted.

aquatic toxicity [131], antimalarial activity and binding affinities to platelet derived growth factor receptor [132]. Radial-Basis Function Neural Networks The Radial-basis Function (RBF) [133] neural networks consist of input layer, a single hidden layer and an output layer. Contrary to MLP-ANNs, the neurons in hidden layer do not compute their output based on the product of the weights and the input values. Each hidden layer neuron is defined by its center, i.e., a point in the feature space of descriptors. The output of the neuron is calculated as the function of the distance between the input compound in the descriptor space and the point constituting the neuron. Typically, the Gaussian function is used, although in principal, some other function of the distance may be applied. The output neuron is of perceptron type, having the output as a transfer function of the product of output values of RBF neurons and their weights. Several parameters are adjusted during the training of the RBF network. A number of RBF neurons has to be created and their centers and scaling of distance, i.e., widths, defined. In case of Gaussian function, these parameters correspond to the mean and standard deviation, respectively. The simplest approach is to create as many neurons as the compounds in the training set and set the centers to the coordinates of the given examples. Clustering of the training set into a number of groups, by e.g. k-means method, and using the group centers and widths can be used alternatively. The orthogonal least squares [134] method can also be employed. This procedure selects a subset of training examples to be used as centers, by sequentially adding examples, which best contribute to the explanation of variance of the activity to be predicted. Once all RBF neurons have been trained, the weights of the connection between them and the output neuron have to be established. This can be done by analytical procedure. The RBF-ANNs have been used in prediction of COX-2 inhibition [103], yielding error lower than MLR but higher than SVM. In prediction of physicochemical properties, such as O-H bond dissociation energy in substituted phenols [91] and capillary electrophoresis absolute mobilities [96] RBFANNs showed accuracy worse than non-linear SVM, but sometimes better than linear SVM. 4.2.4. Decision Trees Decision Trees (DT) [135, 136] differ from most statistical-based classification and regression algorithms by their connection to logic-based and expert systems. In fact, each classification tree can be translated into a set of predictive rules in Boolean logic. The DT classification model consists of a tree-like structure consisting of nodes and links. Nodes are linked hierarchically, with several child nodes branching from a common parent node. A node with no children nodes is called a leaf. Typically, in each node, a test using a single descriptor is made. Based on the result of the test, the algorithm is directed to one of the child nodes branching from the parent. In the child node, another test is performed and further traversal of the tree towards the leafs is carried out. The final decision is based on the activity class associated with the leaf. Thus, the whole decision process is

Fig. (5). Multi-Layer Perceptron neural network with two hidden layers, single output neuron and four descriptors as input. Each neuron realizes an inner product of weights w ij with its input, extended with a bias term. A binary step transfer function is used in each neuron following the calculation of the inner product.

In general, any given function can be approximated using neural network that is sufficiently large, both in terms of number of layers and number of neurons in a layer [128]. However, a big network can overfit if given a finite training set. Thus, the choice of the number of layers and neurons is essential in constructing the networks. Usually, networks consisting of only one hidden layer are used. Another aspect important in the construction of the neural network is the choice of the exact form of the training rule, i.e., the function relating the update of weights with the error of prediction. The most popular is the delta rule, the weight change is proportional to the difference of predicted and expected output and to the input value. The proportionality constant determines the learning rate, i.e., the magnitude of steps used in adjusting the weights. Too large learning rates may prevent the convergence of the training to the minimal error, while too small rates increase the computational overhead. Variable, decreasing with time learning rate may also be used. The next choice in constructing the MLP is the selection of the neuron transfer function, i.e., the function relating the product of input and weights vectors with the output. Typically, a sigmoid function is used. Finally, the initial values of weights of the links between neurons have to be set. General consensus is that small-magnitude random numbers should be used. The MLP neural networks has shown their usefulness in a wide range of QSAR applications, where linear models often fail. In a human intestinal absorption study [83], MLP outperformed the MLR model. However, single MLP networks have been shown inferior to ensembles of such networks in prediction of antifilarial activity, GABAA receptor binding and inhibition of dihydrofolate reductase [129]. In an antibacterial activity study [118], MLP performed better than LDA. This type of networks has also been applied to prediction of logP [104], faring better than linear regression and comparable to decision trees. Modeling of HIV reverse transcriptase inhibitors and E. coli dihydrofolate reductase inhibitors [130] constructed with MLP were better than those relying on MLR or PLS. MLP has also been employed to predict carcinogenic activity [82],

Computational Methods in Developing QSAR

Combinatorial Chemistry & High Throughput Screening, 2006, Vol. 9, No. 3 223

based on the series of simple tests, with results guiding the path from the root of the tree to a leaf, as depicted in Fig. 6.

The DT method may lead to overfitting on the training set if the tree is allowed to grow until the nodes consist purely of one class. Thus, early stopping is used once the nodes are sufficiently pure. Moreover, pruning of the constructed tree [137], i.e., removal of some overspecialized leafs may be introduced to increase the generalization capabilities of the tree [136]. In general, DT methods usually offer suboptimal error rates compared to other non-linear methods, in particular, due to the reliance on single feature in each node. Nonetheless, they are popular in QSAR domain for their ease of interpretability. The tree effectively combines the training process with descriptor selection, limiting the complexity of the model to be analyzed. Furthermore, since several leafs in the tree may correspond to a single activity class, they allow for inspection of different paths leading to the same activity.

Fig. (6). Classification using decision tree based on three molecular descriptors. The traversal of the tree on the path from the root to a leaf node defining the final decision leads through a set of simple tests using the values of molecular descriptors.

The decision trees also handle regression problems [138]. This is done by associating each leaf with a numerical value instead of the categorical class. Furthermore, the criteria of splitting the node to form child nodes are based on the variance of the compounds in that node. Decision Trees have been tested in a study [110] on a wide range of targets, including COX-2 inhibition, bloodbrain barrier permeability, toxicity, multidrug resistance reversal, CDK-2 antagonist activity, dopamine binding affinity, logD and binding to an unspecified channel protein. It performed worse than support vector machines and ensembles of decision trees, but often better than PLS or naive bayes classifier. In a logP study [104], it showed results comparable to MLP-ANNs and better than MLR. In a study concerning P-glycoprotein transport activity [94], DT was worse than both SVM and neural networks, but better than k-NN method. In various datasets related to ecotoxicity [121], decision trees usually achieved lower error than LDA or k-NN methods. Other studies employing decision trees, including anti-HIV activity [139], toxicity [140] and oral absorption [141] have been conducted. 4.2.5. Support Vector Machines The Support Vector Machines (SVM) [142-144] form a group of methods stemming from the structural risk minimization principle, with the linear support vector classifier as its most basic member. The SVC aims at creating a decision hyperplane that maximizes the margin, i.e., the distance from the hyperplane to the nearest examples from each of the classes. This allows for formulating the classifier training as a constrained optimization problem. Importantly, the objective function is unimodal, contrary to e.g. neural networks, and thus can be optimized effectively to global optimum. In the simplest case, compounds from different classes can be separated by linear hyperplane; such hyperplane is defined solely by its nearest compounds from the training set. Such compounds are referred to as support vectors, giving the name to the whole method. The core definitions behind the SVM classification are illustrated in Fig. 7. In most cases, however, no linear separation is possible. To take account of this problem, slack variables are introduced. These variables are associated with the misclassified compounds and, in conjunction with the

The training of the model consists of incremental addition of nodes. The process starts with choosing the test for the root node. A test that optimally separates the compounds into the appropriate activity classes is chosen. If such a test perfectly discriminates the classes, no further nodes are necessary. However, in most cases, a single test is not sufficient. A group of compounds corresponding to one outcome of the test contains examples from different activity classes. Thus, iterative procedure is utilized, starting with root node as a current node. At each step, a current node is examined for meeting the criteria of being a leaf. A node may become a leaf if compounds directed to it by the traversal from the root fall into a single activity category or at least one category forms a clear majority. Otherwise, if the compounds are distributed between several classes, the optimal discriminating criterion is selected. The results of discrimination form child nodes linked to the current node. Since the decision in the current node introduces new discriminatory information, the subsets of compounds on the child nodes are more homogeneously corresponding to single classes. Thus, after several node splitting operations, the leafs can be created. Upon creation of child nodes, they are added to the list of nodes waiting to be assessed and a first node from such a list is chosen as a next current node for evaluation. One should note, that tests carried out in nodes at the same level of the tree may be different in the different branches of the tree, following the different class distribution at the nodes. There are several considerations in the development of decision trees. First, the method for choosing the test to be performed on the node is necessary. As the test usually involves values of a single descriptor, the descriptor ranking criteria outlined in section 3.1 may be employed. Once the descriptor is chosen, the decision rule must be introduced, e.g. a threshold that separates the compounds from two activity classes.

224

Combinatorial Chemistry & High Throughput Screening, 2006, Vol. 9, No. 3

Dudek et al.

margin, are subject to optimization. Thus, even though the erroneous classification cannot be avoided, it is penalized. Since the misclassification of compounds strongly influences the decision hyperplane, the misclassified compounds also become support vectors.

support vector machines among other machine learning methods was conducted [110] using a wide range of endpoints. In this study, SVM was usually better than k-NN, decision trees and linear methods, but slightly inferior to boosting and random forest ensembles. SVMs have also been tested with ADME properties, including modeling of human intestinal absorption [81, 93], binding affinities to human serum albumin [95] and cytochrome P450 inhibition [65]. Studies focused on hemostatic factors have employed support vector machines, e.g. modeling thrombin binding [72] and factor Xa binding [76]. Adverse drug effects, such as carcinogenic potency [148] and Torsade de Pointes [93] were analyzed using the SVM method. Properties such as O-H bond dissociation energy in substituted phenols [91] and capillary electrophoresis absolute mobilities [96] have been also studied using Support Vector Machines, which exhibited higher accuracy than linear regression and RBF neural networks. Other properties predicted with SVM include heat capacity [97] and capacity factor (logk) of peptides in high-performance liquid chromatography [149]. 4.2.6. Ensemble Methods Traditional approach to QSAR analysis focused on constructing a single predictive model. Recently, methods utilizing a combination, or ensemble [150], of models for improving the predication have been proposed. A small ensemble of three classifiers is depicted in Fig. 8. Bagging The Bagging method [151] focuses on improving the quality of prediction by creating a set of base models constructed using the same algorithm, yet with varying training set. Before training of each of base models, the original training set is subject to sampling with replacement. This leads to a group of bootstrap replicas of the original training set. The decisions of the models trained on each replica are averaged to create the final result. The strength of the bagging method stems from its ability to stabilize the classifier by learning on different samples of the original distribution. In QSAR, Bagging and other similar ensembles were used with multi-layer perceptron for prediction of antifilarial activity, binding affinities to GABAA receptors and inhibitors of dihydrofolate reductase [129], and yielding lower error than single MLP neural network. The k-NN and decision trees were used as base method in bagging for prediction of drug likeness [120], showing results slightly worse than SVM, but better than LDA. Random Subspace Method The Random Subspace Method [152] is another example of ensemble scheme aiming at stabilizing the base model. In training of the models, the whole training set is used. However, to enforce diversity, only randomly generated subsets of descriptors are used. The most notable example of random subspaces approach is the Random forest method [153], which uses decision trees as base models. Such a method was recently proposed for use in QSAR in an extensive study including inhibitors of COX-2, blood-brain

Fig. (7). Support vectors and margins in linearly separable (a) and non-separable (b) problems. In non-separable case, negative margins are encountered and their magnitude is subject to optimization along with the magnitude of the positive margins.

The SVC can be easily transformed into a non-linear classifier by employing the kernel function [144, 145]. The kernel function introduces an implicit mapping from the original descriptor space to a high- or infinite-dimensional space. The linear hyperplane in the kernel space may be highly non-linear in the original space. Thus, by training a linear classifier in the kernel space a classifier, which is nonlinear with respect to descriptor space, is obtained. The strength of the kernel function stems from the fact, that while position of compounds in the kernel space may not be explicitly computed, the dot product can be obtained easily. As the algorithm for SVC uses the compound descriptors only within their dot products, this allows for computation of the decision boundary in the kernel space. One of two kernel functions are typically used, the polynomial kernel or the radial-basis function kernel. The SVC method has been shown to exhibit low overtraining and thus allows for good generalization to the previously unseen compounds. It is also relatively robust when only a small number of examples of each class is available. The SVM methods have been extended into Support Vector Regression (SVR) [146] to handle the regression problems. By using methods similar to SVC, e.g. the slack variables and the kernel functions, accurate nonlinear mapping between the activity and the descriptors can be found. However, contrary to typical regression methods, the predicted values are penalized only if their absolute error exceeds certain user-specified threshold, and thus the regression model is not optimal in terms of the least-square error. The SVM method has been shown to exhibit low prediction error in QSAR [147]. Studies of P-glycoprotein substrates used SVMs [93, 94], with results more accurate than neural networks, decision trees and k-NN. A study focused on prediction of drug likeness [120], have shown lower prediction error for SVM than for bagging ensembles and for linear methods. In a study involving COX-2 inhibition and aquatic toxicity [103], SVM outperformed MLR and RBF neural networks. An extensive study using

Computational Methods in Developing QSAR

Combinatorial Chemistry & High Throughput Screening, 2006, Vol. 9, No. 3 225

Fig. (8). Individual decisions of three binary classifiers and the resulting classifier ensemble with more accurate decision boundary.

barrier permeability, estrogen receptor binding activity, multidrug resistance reversal, dopamine binding affinity and P-glycoprotein transport activity [154], yielding better results than single decision tree and than PLS. In an extended study by the same team [110], it fares better than SVM and k-NN. Boosting A more elaborate ensemble scheme is introduced in the Boosting algorithm [155, 156]. In training the base models, it uses all the descriptors and all the training compounds. However, for each compound, a weight is defined. Initially, the weights were uniform. After training a base model, its prediction error is evaluated and weights of incorrectly predicted compounds are increased. This focuses the next base model on previously misclassified examples, even at the cost of making errors for those correctly classified previously. Thus, the compounds with activity hardest to

predict obtain more attention from the ensemble model. The advantage of boosting compared to other ensemble methods is that it allows for use of relatively simple and erroneous base models. Similar to SVM classifier, the power of boosting stems from its ability to create decision boundaries maximizing the margin [157]. In QSAR, the boosting method employing decision tree as base model has been recently shown useful in modeling the COX-2 inhibition, estrogen and dopamine receptors binding, multidrug resistance reversal, CDK-2 antagonist activity, BBB permeability, logD and P-glycoprotein transport activity [110], showing lower prediction error than k-NN, SVM, PLS, decision tree and naive bayes classifier in all cases but one, the P-glycoprotein dataset. In comparison with Random Forest method, it was better for several datasets but worse for other datasets. In other study, a simple decision stump method was used in boosting for human intestinal absorption prediction [90].

226

Combinatorial Chemistry & High Throughput Screening, 2006, Vol. 9, No. 3

Dudek et al.

5. CONCLUDING REMARKS The chemoinformatic methods underlying QSAR analysis are in constant advancement. Well-established techniques continue to be used, providing successful results especially in small, homogeneous datasets consisting of compounds relating to a single mode of action. These techniques include linear methods such as partial least squares. Classical non-linear methods, e.g. artificial neural networks, also remain popular. However, the need for rapid, accurate assessment of large set of compounds is shifting the attention to novel techniques from pattern recognition and machine learning fields. Two such methods, both relying on the concept of margin maximization, have recently gained attention from the QSAR community. These are the support vector machines and ensemble techniques. Recent studies have shown that both yield small prediction error in numerous QSAR applications. Given the complexity of these methods, one may be tempted to treat them as black boxes. Yet, as recently shown, only careful model selection and tuning allows for optimal prediction accuracy [120]. Thus, adoption of novel methods should be preceded by extensive studies. Moreover, within the machine learning and pattern recognition fields, the interpretability of the model is usually not of utmost importance. Thus, the process of adopting emerging techniques from these fields may require substantial effort to develop methods for interpreting the created models. A similar situation can be encountered in preparing the molecular descriptors to be used. The number of different descriptors reaches thousand in some leading commercial tools. Having at hand powerful methods for automatically selecting the informative features, one may be tempted to leave the descriptor selection process entirely to algorithmic techniques. While this may lead to high accuracy of the model, often the chosen descriptors may not give clear insight into the structure-activity relationship. Throughout the review, we have focused on predicting the activity or property given the values of the descriptors for a compound. The inverse problem of finding compounds with desired activity and properties has also attracted attention. Such an inverse-QSAR formulation directly focuses on the goal of drug design, i.e., discovery of active compounds with good pharmacokinetic and other properties. Authors such as Kvasnicka [158] and Lewis [14] have published some algorithms in this direction. Significant insight has also been given by Kier and Hall [159] and Zefirov [160]. Galvez and co-workers [161] have shown that topological indices are particularly suited to this aim. The reason is that whereas the conventional physical and geometrical descriptors are structure-related, topological indices are just an algebraic description of the structure itself. Thus, one can go backward and forward between structure and property, predicting properties for molecules and vice versa. Since the methods for solving the inverse problem are not yet widely adopted, the creation of QSAR models remains the main task in computer-aided drug discovery. In general, the adoption of novel, more accurate QSAR modeling techniques does not reduce the responsibility of the

investigator. On the contrary, the more complex and optimized is the model, the more caution it requires during its application. Combined with the increased complexity of the inspected datasets, this makes the QSAR analysis a challenging endeavor. REFERENCES
[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] Bleicher, K.H.; Boehm, H.-J.; Mueller, K.; Alanine, A.I. Nat. Rev. Drug Discov., 2003, 2, 369-378. Gershell, L.J.; Atkins, J.H. Nat. Rev. Drug Discov., 2003, 2, 321327. Goodnow, R.; Guba, W.; Haap, W. Comb. Chem. High Throughput Screen., 2003, 6, 649-660. Shoichet, B.K. Nature, 2004, 432, 862-865. Stahura, F.L.; Bajorath, J. Comb. Chem. High Throughput Screen., 2004, 7, 259-269. Hansch, C.; Fujita, T. J. Am. Chem. Soc., 1964, 86, 1616-1626. Bajorath, J. Nat. Rev. Drug Discov., 2002, 1, 882-894. Pirard, B. Comb. Chem. High Throughput Screen., 2004, 7 , 271280. Hodgson, J. Nat. Biotechnol., 2001, 19, 722-726. van de Waterbeemd, H.; Gifford, E. Nat. Rev. Drug Discov., 2003, 2, 192-204. Proudfoot, J.R. Bioorg. Med. Chem. Lett., 2002, 12, 1647-1650. Roche, O.; Schneider, P.; Zuegge, J.; Guba, W.; Kansy, M.; Alanine, A.; Bleicher, K.; Danel, F.; Gutknecht, E.M.; RogersEvans, M.; Neidhart, W.; Stalder, H.; Dillon, M.; Sjogren, E.; Fotouhi, N.; Gillepsie, P.; Goodnow, R.; Harris, W.; Jones, P.; Taniguchi, M.; Tsujii, S.; von der Saal, W.; Zimmermann, G.; Schneider, G. J. Med. Chem., 2002, 45, 137-142. Rusinko, A.; Young, S.S.; Drewry, D.H.; Gerritz, S.W. Comb. Chem. High Throughput Screen., 2002, 5, 125-133. Lewis, R.A. J. Med. Chem., 2005, 48, 1638-1648. Mulliken, R.S. J. Phys. Chem., 1955, 23, 1833-1840. Cammarata, A. J. Med. Chem., 1967, 10, 525-552. Stanton, D.T.; Egolf, L.M.; Jurs, P.C.; Hicks, M.G. J. Chem. Info. Comput. Sci., 1992, 32, 306-316. Klopman, G. J. Am. Chem. Soc., 1968, 90, 223-234. Zhou, Z.; Parr, R.G. J. Am. Chem. Soc., 1990, 112, 5720-5724. Wiener, H. J. Am. Chem. Soc., 1947, 69, 17-20. Randic, M. J. Am. Chem. Soc., 1975, 97, 6609-6615. Balaban, A.T. Chem. Phys. Lett., 1982, 89, 399-404. Schultz, H.P. J. Chem. Inf. Comput. Sci., 1989, 29, 227-222. Kier, L.B.; Hall, L.H. J. Pharm. Sci., 1981, 70, 583-589. Galvez, J.; Garcia, R.; Salabert, M.T.; Soler, R. J. Chem. Inf. Comput. Sci., 1994, 34, 520-552. Pearlman, R.S.; Smith, K. Perspect. Drug Discov. Des., 1998, 9-11, 339-353. Stanton, D.T. J. Chem. Inf. Comput. Sci., 1999, 39, 11-20. Burden, F. J. Chem. Inf. Comput. Sci., 1989, 29, 225-227. Estrada, E. J. Chem. Info. Comput. Sci., 1996, 36, 844-849. Estrada, E.; Uriarte, E. SAR QSAR Environ. Res., 2001, 12, 309324. Hall, L.H.; Kier, L.B. Quant. Struct.-Act. Relat., 1991, 10, 43-48. Hall, L.H.; Kier, L.B. J. Chem. Inf. Comput. Sci., 2000, 30, 784791. Labute, P. J. Mol. Graph. Model., 2000, 18, 464-477. Higo, J.; Go, N. J. Comput. Chem., 1989, 10, 376-379. Katritzky, A.R.; Mu, L.; Lobanov, V.S.; Karelson, M. J. Phys. Chem., 1996, 100, 10400-10407. Rohrbaugh, R.H.; Jurs, P.C. Anal. Chim. Acta, 1987, 199, 99-109. Pearlman, R.S. In: Physical Chemical Properties of Drugs ; Yalkowsky, S.H.; Sinkula, A.A.; Valvani, S.C., Eds.; Marcel Dekker: New York, 1988. Weiser, J.; Weiser, A.A.; Shenkin, P.S.; Still, W.C. J. Comp. Chem., 1998, 19, 797-808. Barnard, J.M.; Downs, G.M. J. Chem. Inf. Comput. Sci., 1997, 37, 141-142. http://www.mdl.com/. Durant, J.L.; Leland, B.A.; Henry, D.R.; Nourse, J.G. J. Chem. Info. Comput. Sci., 2002, 42, 1273-1280. Waller, C.L.; Bradley, M.P. J. Chem. Info. Comput. Sci., 1999, 39, 345-355. Winkler, D.; Burden, F.R. Quant. Struct.-Act. Relat., 1998, 17, 224-231.

[13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37]

[38] [39] [40] [41] [42] [43]

Computational Methods in Developing QSAR [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] [56] [57] [58] [59] [60] [61] [62] [63]

Combinatorial Chemistry & High Throughput Screening, 2006, Vol. 9, No. 3 227 [88] [89] [90] [91] [92] [93] [94] [95] [96] [97] [98] [99] [100] [101] [102] [103] [104] [105] [106] [107] [108] [109] [110] [111] [112] [113] [114] [115] [116] [117] [118] Kittler, J. Pattern Recognition and Signal Processing, 1978, E29, 41-60 Bi, J.; Bennet, K.; Embrechts, M.; Breneman, C.; Song, M. J. Mach. Learn. Res., 2003, 3, 1229-1243. Wegner, J.K.; Frohlich, H.; Zell, A. J. Chem. Inf. Comput. Sci., 2004, 44, 931-939. Xue, C.X.; Zhang, R.S.; Liu, H.X.; Yao, X.J.; Liu, M.C.; Hu, Z.D.; Fan, B.T. J. Chem. Inf. Comput. Sci., 2004, 44, 669-677. Guyon, I.; Weston, J.; Barnhill, S.; Vapnik, V. Mach. Learn., 2002, 46, 389-422. Xue, Y.; Li, Z.R.; Yap, C.W.; Sun, L.Z.; Chen, X.; Chen, Y.Z. J. Chem. Inf. Comput. Sci., 2004, 44, 1630-1638. Xue, Y.; Yap, C.W.; Sun, L.Z.; Cao, Z.W.; Wang, J.F.; Chen, Y.Z. J. Chem. Inf. Comput. Sci., 2004, 44, 1497-1505. Xue, C.X.; Zhang, R.S.; Liu, H.X.; Yao, X.J.; Liu, M.C.; Hu, Z.D.; Fan, B.T. J. Chem. Inf. Comput. Sci., 2004, 44, 1693-1700. Xue, C.X.; Zhang, R.S.; Liu, M.C.; Hu, Z.D.; Fan, B.T. J. Chem. Inf. Comput. Sci., 2004, 44, 950-957. Xue, C.X.; Zhang, R.S.; Liu, H.X.; Liu, M.C.; Hu, Z.D.; Fan, B.T. J. Chem. Inf. Comput. Sci., 2004, 44, 1267-1274. Senese, C.L.; Duca, J.; Pan, D.; Hopfinger, A.J.; Tseng, Y.J. J. Chem. Inf. Comput. Sci., 2004, 44, 1526-1539. Trohalaki, S.; Pachter, R.; Geiss, K.; Frazier, J. J. Chem. Inf. Comput. Sci., 2004, 44, 1186-1192. Roy, K.; Ghosh, G. J. Chem. Inf. Comput. Sci., 2004, 44, 559-567. Hou, T.J.; Zhang, W.; Xia, K.; Qiao, X.B.; Xu, X.J. J. Chem. Inf. Comput. Sci., 2004, 44, 1585-1600. Hou, T.J.; Xia, K.; Zhang, W.; Xu, X.J. J. Chem. Inf. Comput. Sci., 2004, 44, 266-275. Yao, X.J.; Panaye, A.; Doucet, J.P.; Zhang, R.S.; Chen, H.F.; Liu, M.C.; Hu, Z.D.; Fan, B.T. J. Chem. Inf. Comput. Sci., 2004, 44, 1257-1266. Tino, P.; Nabney, I.T.; Williams, B.S.; Losel, J.; Sun, Y. J. Chem. Inf. Comput. Sci., 2004, 44, 1647-1653. Wold, S.; Ruhe, A.; Wold, H.; Dunn, W. SIAM J. Sci. Stat. Comput., 1984, 5, 735-743. Wold, S.; Sjostrom, M.; Eriksson, L. Chemom. Intell. Lab. Syst., 2001, 58, 109-130. Phatak, A.; de Jong, S. J. Chemom., 1997, 11, 311-338. Zhang, H.; Li, H.; Liu, C. J. Chem. Inf. Model., 2005, 45, 440-448. Waller, C.L. J. Chem. Inf. Comput. Sci., 2004, 44, 758-765. Svetnik, V.; Wang, T.; Tong, C.; Liaw, A.; Sheridan, R.P.; Song, Q. J. Chem. Inf. Model., 2005, 45, 786-799. Clark, M. J. Chem. Inf. Model., 2005, 45, 30-38. Catana, C.; Gao, H.; Orrenius, C.; Stouten, P.F.W. J. Chem. Inf. Model., 2005, 45, 170-176. Sun, H. J. Chem. Inf. Comput. Sci., 2004, 44, 748-757. Adenot, M.; Lahana, R. J. Chem. Inf. Comput. Sci., 2004, 44, 239248. Feng, J.; Lurati, L.; Ouyang, H.; Robinson, T.; Wang, Y.; Yuan, S.; Young, S.S. J. Chem. Inf. Comput. Sci., 2003, 43, 1463-1470. Fisher, R. Ann. Eugen., 1936, 7, 179-188. Guha, R.; Jurs, P.C. J. Chem. Inf. Model., 2005, 45, 65-73. Murcia-Soler, M.; Perez-Gimenez, F.; Garcia-March, F.J.; Salabert-Salvador, T.; Diaz-Villanueva, W.; Castro-Bleda, M.J.; Villanueva-Pareja, A. J. Chem. Inf. Comput. Sci., 2004, 44, 10311041. Molina, E.; Diaz, H.G.; Gonzalez, M.P.; Rodriguez, E.; Uriarte, E. J. Chem. Inf. Comput. Sci., 2004, 44, 515-521. Mueller, K.-R.; Raetsch, G.; Sonnenburg, S.; Mika, S.; Grimm, M.; Heinrich, N. J. Chem. Inf. Model., 2005, 45, 249-253. Mazzatorta, P.; Benfenati, E.; Lorenzini, P.; Vighi, M. J. Chem. Inf. Comput. Sci., 2004, 44, 105-112. Hawkins, D.M. J. Chem. Inf. Comput. Sci., 2004, 44, 1-12. Klon, A.E.; Glick, M.; Davies, J.W. J. Chem. Inf. Comput. Sci., 2004, 44, 2216-2224. Cover, T.; Hart, P. IEEE Trans. Inform. Theory, 67, 13, 21-27. Jain, A.; Mao, J.; Mohiuddin, K. Computer ,1996, 29, 31-44. Rosenblatt, F. Psychol. Rev., 1958, 65, 386-408. Gallant, S. IEEE Trans. Neural Networks, 1990, 1, 179-191. Hornik, K.; Stinchcombe, M.; White, H. Neural Networks, 1989, 2, 359-366. Agrafiotis, D.K.; Cedeno, W.; Lobanov, V.S. J. Chem. Inf. Comput. Sci., 2002, 42, 903-911. Chiu, T.-L.; So, S.-S. J. Chem. Inf. Comput. Sci., 2004, 44, 154160.

[64] [65] [66] [67] [68] [69] [70] [71] [72] [73] [74] [75] [76] [77] [78] [79] [80] [81] [82] [83] [84] [85] [86] [87]

Maurer, W.D.; Lewis, T.G. ACM Comput. Surv., 1975, 7, 5-19. http://www.daylight.com/. Deshpande, M.; Kuramochi, M.; Wale, N.; Karypis, G. IEEE Trans. on Knowledge and Data Eng., 2005, 17, 1036-1050. Guner, O.F. Curr. Top. Med. Chem., 2002, 2, 1321-1332. Akamatsu, M. Curr. Top. Med. Chem., 2002, 2, 1381-1394. Lemmen, C.; Lengauer, T. J. Comput.-Aided Mol. Des., 2000, 14, 215-232. Dove, S.; Buschauer, A. Quant. Struct.-Act. Relat., 1999, 18, 329341. Cramer, R.D.; Patterson, D.E.; Bunce, J.D. J. Am. Chem. Soc., 1988, 110, 5959-5967. Klebe, G.; Abraham, U.; Mietzner, T. J. Med. Chem., 1994, 37, 4130-4146. Silverman, B.D.; Platt, D.E. J. Med. Chem., 1996, 39, 2129-2140. Todeschini, R.; Lasagni, M.; Marengo, E. J. Chemom., 1994, 8, 263-272. Todeschini, R.; Gramatica, P. Perspect. Drug Discov. Des., 1998, 9-11, 355-380. Bravi, G.; Gancia, E.; Mascagni, P.; Pegna, M.; Todeschini, R.; Zaliani, A. J. Comput.-Aided Mol. Des., 1997, 11, 79-92. Cruciani, G.; Crivori, P.; Carrupt, P.-A.; Testa, B. J. Mol. Struct.: THEOCHEM., 2000, 503, 17-30. Crivori, P.; Cruciani, G.; Carrupt, P.-A.; Testa, B. J. Med. Chem., 2000, 43, 2204-2216. Pastor, M.; Cruciani, G.; McLay, I.; Pickett, S.; Clementi, S. J. Med. Chem., 2000, 43, 3233-3243. Cho, S.J.; Tropsha, A. J. Med. Chem., 1995, 38, 1060-1066. Cho, S.J.; Tropsha, A.; Suffness, M.; Cheng, Y.C.; Lee, K.H. J. Med. Chem., 1996, 39, 1383-1395. Estrada, E.; Molina, E.; Perdomo-Lopez, J. J. Chem. Inf. Comput. Sci., 2001, 41, 1015-1021. de Julian-Ortiz, J.V.; de Gregorio Alapont, C.; Rios-Santamarina, I., Garcia-Domenech, R.; Galvez, J. J. Mol. Graphics Modell., 1998, 16, 14-18. Guyon, I.; Elisseeff, A. J. Mach. Learn. Res., 2003, 3, 1157-1182. Merkwirth, C.; Mauser, H.; Schulz-Gasch, T.; Roche, O.; Stahl, M.; Lengauer, T. J. Chem. Inf. Comput. Sci., 2004, 44, 1971-1978. Guha, R.; Jurs, P.C. J. Chem. Inf. Comput. Sci., 2004, 44, 21792189. Gallegos, A.; Girone, X. J. Chem. Inf. Comput. Sci., 2004, 44, 321326. Verdu-Andres, J.; Massart, D.L. Appl. Spectrosc., 1998, 52, 14251434. Farkas, O.; Heberger, K. J. Chem. Inf. Model., 2005, 45, 339-346. Heberger, K.; Rajko, R. J. Chemom., 2002, 16, 436-443. Rajko, R.; Heberger, K. Chemom. Intell. Lab. Syst., 2001, 57, 1-14. Liu, Y. J. Chem. Inf. Comput. Sci., 2004, 44, 1823-1828. Venkatraman, V.; Dalby, A.R.; Yang, Z.R. J. Chem. Inf. Comput. Sci., 2004, 44, 1686-1692. Lin, T.-H.; Li, H.-T.; Tsai, K.-C. J. Chem. Inf. Comput. Sci., 2004, 44, 76-87. Massey, F. J. J. Amer. Statistical Assoc., 1951, 46, 68-78. Byvatov, E.; Schneider, G. J. Chem. Inf. Comput. Sci., 2004, 44, 993-999. Kohavi, R.; John, G. Artiff. Intell., 1997, 97, 273-324. Michalewicz, Z. Genetic algorithms + data structures = evolution programs (3rd ed.); Springer-Verlag: London, UK, 1996. Siedlecki, W.; Sklansky, J. Int. J. Pattern Recog. Artiff. Intell., 1988, 2, 197-220. Siedlecki, W.; Sklansky, J. Pat. Rec. Lett., 1989, 10, 335-347. Wegner, J. K.; Frohlich, H.; Zell, A. J. Chem. Inf. Comput. Sci., 2004, 44, 921-930. Hemmateenejad, B.; Safarpour, M.A.; Miri, R.; Nesari, N. J. Chem. Inf. Model., 2005, 45, 190-199. Wessel, M.D.; Jurs, P.C.; Tolan, J.W.; Muskal, S.M. J. Chem. Inf. Comput. Sci., 1998, 38, 726-735. Baurin, N.; Mozziconacci, J.-C.; Arnoult, E.; Chavatte, P.; Marot, C.; Morin-Allory, L. J. Chem. Inf. Comput. Sci., 2004, 44, 276-285. Kirkpatrick, S.; C.D. Gelatt, J.; Vecchi, M.P. I n Readings in computer vision: issues, problems, principles, and paradigms ; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1987. Itskowitz, P.; Tropsha, A. J. Chem. Inf. Model., 2005, 45, 777-785. Sutter, J.M.; Dixon, S.L.; Jurs, P.C. J. Chem. Info. Comput. Sci., 1995, 35, 77-84.

[119] [120] [121] [122] [123] [124] [125] [126] [127] [128] [129] [130]

228 [131] [132] [133] [134] [135] [136] [137] [138] [139]

Combinatorial Chemistry & High Throughput Screening, 2006, Vol. 9, No. 3 Gini, G.; Craciun, M.V.; Konig, C. J. Chem. Inf. Comput. Sci., 2004, 44, 1897-1902. Guha, R.; Jurs, P.C. J. Chem. Inf. Model., 2005, 45, 800-806. Mulgrew, B. IEEE Sig. Proc. Mag., 1996, 13, 50-65. Chen, S.; Cowan, C.F.N.; Grant, P.M. IEEE Trans. Neural Networks, 1991, 2, 302-309. Quinlan, J.R. Mach. Learn., 1986, 1, 81-106. Gelfand, S.; Ravishankar, C.; Delp, E. IEEE Trans. Pattern Anal. Mach. Intell., 1991, 13, 163-174. Mingers, J. Mach. Learn., 1989, 4, 227-243. Breiman, L.; Friedman, J.; Olshen, R.; Stone, C. Classification and regression trees; Wadsworth: California, 1984. Daszykowski, M.; Walczak, B.; Xu, Q.-S.; Daeyaert, F.; de Jonge, M.R.; Heeres, J.; Koymans, L.M.H.; Lewi, P.J.; Vinkers, H.M.; Janssen, P.A.; Massart, D.L. J. Chem. Inf. Comput. Sci., 2004, 44, 716-26. DeLisle, R.K.; Dixon, S.L. J. Chem. Inf. Comput. Sci., 2004, 44, 862-870. Bai, J.P.F.; Utis, A.; Crippen, G.; He, H.-D.; Fischer, V.; Tullman, R.; Yin, H.-Q.; Hsu, C.-P.; Jiang, L.; Hwang, K.-K. J. Chem. Inf. Comput. Sci., 2004, 44, 2061-209. Cortes, C.; Vapnik, V. Mach. Learn., 1995, 20, 273-297 Vapnik, V. The Nature of Statistical Learning Theory; Springer Verlag: New York, 1995. Burges, C. Data Min. Knowl. Discov., 1998, 2, 121-167 Boser, B.E.; Guyon, I.M.; Vapnik, V.N. In COLT '92: Proceedings of the fifth annual workshop on Computational learning theory ; ACM Press: New York, NY, USA, 1992. [146] [147] [148] [149] [150] [151] [152] [153] [154] [155] [156] [157] [158] [159] [160] [161]

Dudek et al. Smola, A.; Schoelkopf, B. Stat. Comput., 2004, 14, 199-222. Burbidge, R.; Trotter, M.; Buxton, B.F.; Holden, S.B. Comput. Chem., 2001, 26, 5-14. Helma, C.; Cramer, T.; Kramer, S.; Raedt, L. D. J. Chem. Inf. Comput. Sci., 2004, 44, 1402-1411. Liu, H.X.; Xue, C.X.; Zhang, R.S.; Yao, X.J.; Liu, M.C.; Hu, Z.D.; Fan, B.T. J. Chem. Inf. Comput. Sci., 2004, 44, 1979-1986. Meir, R.; Raetsch, G. Lecture Notes in Computer Sci., 2003, 2600, 118-183. Breiman, L. Mach. Learn., 1996, 24, 123-140. Ho, T.K. IEEE Trans. Pat. Anal. Mach. Intell., 1998, 20, 832-844. Breiman, L. Mach. Learn., 2001, 45, 5-32. Svetnik, V.; Liaw, A.; Tong, C.; Culberson, C.; Sheridan, R.P.; Feuston, B.P. J. Chem. Inf. Comput. Sci., 2003, 43, 1947-1958. Freund, Y.; Schapire, R. J. Comp. Sys. Sci., 1997, 55, 119-139. Freund, Y.; Schapire, R. J. Japanese Soc. for Artificial Intelligence, 1999, 14, 771-780. Schapire, R. E.; Freund, Y.; Bartlett, P.; Lee, W. S. The Annals of Statistics 1998, 26, 1651-1686. Kvasnicka, V.; Pospichal, J. J. Chem. Inf. Comput. Sci., 1996, 36, 516-526. Kier, L.B.; Hall, L.H. Quant. Struct.-Act. Relat., 1993, 12, 383-388. Skvortsova, M.; Baskin, I.; Slovokhotova, O.; Palyulin, P.; Zefirov, N. J. Chem. Inf. Comput. Sci., 1993, 33, 630-634. Galvez, J.; Garcia-Domenech, R.; Bernal, J.M.; Garcia-March, F. J. An. Real Acad. Farm., 1991, 57, 533-546.

[140] [141]

[142] [143] [144] [145]

Received: November 1, 2005

Revised: November 15, 2005

Accepted: December 14, 2005

Anda mungkin juga menyukai