Anda di halaman 1dari 64

Quantitative Structure-Activity Relationships

Quantitative Structure-Property-Relationships

SAR/QSAR/QSPR modeling

Alexandre Varnek
Faculté de Chimie, ULP, Strasbourg, FRANCE
SAR/QSAR/QSPR models
• Development
• Validation
• Application
Classification and Regression
models
• Development
• Validation
• Application
Development of the models
• Selection and curation of experimental data
• Preparation of training and test sets (optionaly)
• Selection of an initial set of descriptors and their
normalisation
• Variables selection (optionally)
• Selection of a machine-learning method

Validation of models
• Training/test set
• Cross-validation
- internal,
- external

Application of the Models


• Models Applicability Domain
Development the models

• Experimental Data: selection and cleaning


• Descriptors
• Mathematical techniques
• Statistical criteria
Data selection: Congenericity problem

• Congenericity principle is the assumption that « similar


compounds give similar responses ». This was the basic
requirement of QSAR. This concerns structurally
homogeneous data sets.

• Nowdays, experimentalists mostly produce structurally


diverse (non-congeneric) data sets
Data cleaning:

• Similar experimental conditions


• Dublicates
• Structures standardization
• Removal of mixtures
• …..
The importance of Chemical Data Curation

Dataset curation is crucial for any cheminformatics analysis (QSAR


modeling, clustering, similarity search, etc.).

Currently, it is uncommon to describe procedures used for curation


in research papers; procedures are implemented or employed
differently in different groups.

We wish to emphasize the need to create and popularize


standardized curation strategy, applicable for any ensemble of
compounds.
What about these structures? (real examples)
Why duplicates are unsafe for QSAR ?
Duplicates are identical compounds present in a given dataset.
CH3 OH

HO
CH3 CH3

CH3 OH

H3C

H3C
CH3
OH
OH CH3

OH CH3

ID = 256 ID = 879 ID = 2346


Manual identification of duplicates is practically impossible especially when the dataset is large.

Activity analysis of duplicates is also highly important to identify cases where one occurrence is
identified as ‘active’ and another one as ‘weak active’ or ‘inactive’.

CH3

HO CH3

OH

H3C

H3C INACTIVE
ACTIVE
CH3

OH CH3
OH
Structural standardization
For a given dataset, chemical groups have to be written in a standardized way, taking
into account critical properties (like pH) of the modeled system.
Aromatic compounds
OH OH

These two different representations of the same


compound will lead to different descriptors, especially
with certain fingerprint or fragmental approaches.
Cl Cl

CH3
CH3

Carboxylic acids, nitro groups etc.

HO O O O O O– O O O OH
N N+

X X X X X
For a given dataset, these functional groups have to be written in a consistent way to
avoid different descriptor values for the same chemical group.
Normalization of carboxylic, nitro groups, etc.
removal of inorganics

All inorganic compounds must be removed since our QSAR


modeling strategy includes the calculation of molecular
descriptors for organic compounds only.
This is an obvious limitation of the approach. However the total fraction of
inorganics in most available datasets is relatively small.

To detect inorganics, several solutions are


available:

- Automatic identification using in


combination Jchem (ChemAxon, cxcalc
program) to output the empirical formula
of all compounds and simple scripts to
remove compounds with no carbon;

- Manual inspection of compounds


possessing no carbon atom using
Notepad++ tools.
removal of mixtures

Fragments can be removed according to the number of


constitutive atoms or the molecular weight.
removal of mixtures
However, some cases are particularly difficult to treat.
Examples from DILI - BIOWISDOM dataset:

ID=172

CLEANED FORM BY CHEMAXON


The two eliminated compounds
could be active !
.
INITIAL FORM
MANUAL INSPECTION/VALIDATION IS STILL CRUCIAL
ID=1700

CLEANED FORM BY CHEMAXON Ok.


INITIAL FORM
removal of salts

Options Remove Fragments, Neutralize and Transform of Chemaxon


Standardizer. have to be used simultaneously for best results.
Aromatization and 2D cleaning
ChemAxon Standardizer offers two ways to aromatize benzene rings,
both of them based on Hűckel’s rules.

“General Style”
O
CH3

NH

OH

“Basic Style”
CH3 O

NH

OH

Most descriptor calculation


packages recognize the “basic
style” only.
http://www.chemaxon.com/jchem/marvin/help/sci/aromatization-doc.html
Preparation of training and test sets

Building of structure -
property models

Selection of the best

Training set
models according to
statistical criteria
Initial data set

Splitting of an initial
data set into training
and test sets

10 – 15 %
“Prediction” calculations
Test

using the best structure -


property models
Recommendations to prepare a test set

• (i) experimental methods for determination of activities in the training


and test sets should be similar;

• (ii) the activity values should span several orders of magnitude, but
should not exceed activity values in the training set by more than 10%;

• (iii) the balance between active and inactive compounds should be


respected for uniform sampling of the data.

References: Oprea, T. I.; Waller, C. L.; Marshall, G. R. J. Med. Chem. 1994, 37, 2206-2215
Descriptors

• Variables selction
• Normalization

descriptors
molecules

Pattern matrix
Selection of descriptors for QSAR model
QSAR models should be reduced to a set of descriptors which is
as information rich but as small as possible.

Objective selection (independent variable only)


Statistical criteria of correlations
Pairwise selection (Forward or Backward Stepwise selection)
Principal Component Analysis
Partial Least Square analysis
Genetic Algorithm
……………….

Subjective selection
Descriptors selection based on mechanistic studies
Preprocessing strategy for the derivation of models
for use in structure-activity relationships (QSARs)

1. identify a subset of columns (variables) with significant


correlation to the response;
2. remove columns (variables) with zero (small) variance;
3. remove columns (variables) with no unique information;
4. identify a subset of variables on which to construct a model;
5. address the problem of chance correlation.

D. C. Whitley, M. G. Ford, D. J. Livingstone


J. Chem. Inf. Comput. Sci. 2000, 40, 1160-1168
Descriptors Normalisation
descriptors
n
m j  (1/ n) xij*
molecules

i 1
n
s  (1/ n) ( xij*  m j ) 2
2
j
i 1

Pattern matrix

Normalisation 1 (Unit Variance scaling): xij  x  m j


*
ij

xij*  m j
Normalisation 2 (Mean Centring Scaling): xij 
sj
Data Normalisation

Initial Norm. 1 Norm. 2


descriptors
Machine-Learning Methods
Fitting models’ parameters
Y = F(ai , Xi )

Xi - descriptors (independent variables)


ai - fitted parameters

The goal is to minimize Residual Sum of Squared (RSS)


N
RSS   ( yexp, i  ycalc,i ) 2

i 1
Multiple Linear Regression
Activity Descriptor
Y

Y1 X1
Y2 Y2
… … X
Yn Xn

Yi = a0 + a1 Xi1
Multiple Linear Regression
y=ax+b

Residual Sum of
Squared (RSS)
N b
RSS   ( yi  ycalc,i ) 2

i 1
a
Multiple Linear Regression

Activity Descr 1 Descr 2 … Descr m

Y1 X11 X12 … X1m

Y2 X21 X22 … X2m


… … … … …

Yn Xn1 Xn2 … Xnm

Yi = a0 + a1 Xi1 + a2 Xi2 +…+ am Xim


kNN (k Nearest Neighbors)

Activity Y assessment calculating a weighted mean of the


activities Yi of its k nearest neighbors in the chemical space

Descriptor 2

TRAINING SET

Descriptor 1

A.Tropsha, A.Golbraikh, 2003


Biological and Artificial Neuron
Multilayer Neural Network

Neurons in the input layer correspond to descriptors, neurons in the output


layer – to properties being predicted, neurons in the hidden layer – to nonlinear
latent variables
SVM: Support Vector Machine

 w, x  b  1
 w, x  b  0
 w, x  b  1

2
w

Support Vector Classification (SVC)


SVM: Margins
The margin is the minimal
distance of any training point to
the separating hyperplane

1
Margin 
w
Support Vector Regression

ε-Insensitive Loss Function

 0 if   
Only the points outside the ε-   : 
tube are penalized in a linear     otherwise
fashion
Kernel Trick

In low-dimensional
input space
K ( x, x)  ( x), ( x)  In high-dimensional
feature space

Any non-linear problem (classification, regression) in the original input space can be
converted into linear by making non-linear mapping Φ into a feature space with
higher dimension
QSAR/QSPR models
• Development
• Validation
• Application
Preparation of training and test sets

Building of structure -
property models

Selection of the best

Training set
models according to
statistical criteria
Initial data set

Splitting of an initial
data set into training
and test sets

10 – 15 %
“Prediction” calculations
Test

using the best structure -


property models
Validation
Estimation of the models predictive performance
5- Fold Cross Validation

All
compounds
of the
dataset are
predicted

Dataset Fold1 Fold2 Fold3 Fold4 Fold5


Leave-One Out Cross-Validation
N- Fold Internal Cross Validation

• Cross-validation is performed AFTER variables selection on the entire dataset.

• On each fold, the “test” set contains only 1 molecule


Statistical parameters for Regression

42
Fitting vs validation

Stabilities (logK) of Sr2+L complexes in water


LogKcalc LogKpred
12 12
LOO 12
5-CV
9 Fit 9 9

6 6
6
3
3
3 R2 = 0.886 R2= 0.826 R2 = 0.682
0
0
RMSE = 0.97 0 RMSE = 1.20 RMSE = 1.62
-3
0 3 6 9 12 15 0 3 6 9 12 15 3 6 9 12 15

LogKexp

All molecules were used for Each molecule was Each molecule was predicted
the model preparation “predicted” in internal CV in external CV
Regression Error Characteristic (REC)

REC curves are widely used to compare of the performance of different models.
The gray line corresponds to average value model (AM). For a given model, the
area between AM and corresponding calculated curve reflects its quality.
Statistical parameters for Classification

Confusion Matrix
Classification Evaluation

sensitivity = true positive rate (TPR) = hit rate = recall


TPR = TP / P = TP / (TP + FN)

false positive rate (FPR)


FPR = FP / N = FP / (FP + TN)

specificity (SPC) = True Negative Rate


SPC = TN / N = TN / (FP + TN) = 1 − FPR

positive predictive value (PPV) = precision


PPV = TP / (TP + FP)

negative predictive value (NPV)


NPV = TN / (TN + FN)

accuracy (ACC)
ACC = (TP + TN) / (P + N)

balanced accuracy (BAC)


BAC = (sensitivity + sensitivity ) / 2 = (TP / (TP + FN) + TN / (FP + TN)) /2
Receiver Operating Characteristic (ROC)
TPR
Plot of the sensitivity vs (1 −
specificity) for a binary classifier
system as its discrimination threshold
is varied.

The ROC can also be represented


equivalently by plotting the fraction of
true positives (TPR = true positive
rate) vs the fraction of false positives
(FPR = false positive rate).

FPR
Ideally, Area Under Curve (AUC) => 1
ROC (Receiver Operating Characteristics)

100%
TP 0 1 2 3 FP a b c d

4 5 6 7 8 9 e f g h i j

FN 0 1 2 3 TN a b c d

4 5 6 7 8 9 e f g h i j

Ideal model:
TP% AUC=0.84
AUC=1.00
j g 0 5
a 2 c 8
Useless model: b 1
3
h d
AUC=0.50 6 4
e f
i 7 9

0% FP% 100%
When a model is accepted ?

Regression Models Classification Models

3 classes

Determination coefficient R2 > R02 BA > 1/q for q classes


Here, R02 = 0.5
49
“Chance correlation” problem

2,000 1

1,500 0.75

1,000 0.5

1965 1970 1975 1980


year
a model MUST be validated on new independent
data to avoid a chance correlation
Y-Scrambling
(for methods without descriptor selection)
X1 Y1 Y2
X2 Y2 Y5
X3 Y3 Y4
X4 Y4 Y6
X5 Y5 Y1
X6 Y6 Y7
X7 Y7 Y3

R2
0.0 1.0
Y-Scrambling
(for methods without descriptor selection)
X1 Y1 Y4
X2 Y2 Y1
X3 Y3 Y5
X4 Y4 Y2
X5 Y5 Y6
X6 Y6 Y3
X7 Y7 Y7

R2
0.0 1.0
Y-Scrambling
(for methods without descriptor selection)
X1 Y1 Y7
X2 Y2 Y6
X3 Y3 Y3
X4 Y4 Y5
X5 Y5 Y4
X6 Y6 Y1
X7 Y7 Y2

R2
0.0 1.0
QSAR/QSPR models
• Development
• Validation
• Application
QSPR Models Test compound

Prediction Performance

Robustness of QSPR models Applicability domain of models


- Descriptors type; Is a test compound similar
- Descriptors selection; to the training set
- Machine-learning methods;
- Validation of models.
compounds?
Applicability domain of QSAR models

Descriptor 2 The new compound will be predicted by


the model, only if :
Di ≤ <Dk> + Z × sk
with Z, an empirical parameter (0.5 by default)

TRAINING SET
Descriptor 1
= TEST COMPOUND

INSIDE THE DOMAIN OUTSIDE THE DOMAIN

Will be predicted
Will not be predicted
Applicability
Applicability domain of QSAR models
Domain Approaches

Fragment –based methods Density based methods


 Fragment Control (FC)
 1-SVM
 Model’s Fragment Control
(MFC)

Distance –based methods Range –based methods

 zkNN  Bounding Box (BB)


Ensemble modeling
Hunting season …

Single hunter
Hunting season …

Many hunters
Ensemble modelling
Ensemple modeling

Y1 Y2 Y3

1 n
Consensus =  Y i
n i 1
Screening and hits selection
Database
O

COOH
Cl

Virtual
Br

OH
N
OH
Sreening
N

OH

QSPR model N

COOH
Hits
Useless
compounds
Experimental
Br
O

Tests

Anda mungkin juga menyukai