Quantitative Structure-Property-Relationships
SAR/QSAR/QSPR modeling
Alexandre Varnek
Faculté de Chimie, ULP, Strasbourg, FRANCE
SAR/QSAR/QSPR models
• Development
• Validation
• Application
Classification and Regression
models
• Development
• Validation
• Application
Development of the models
• Selection and curation of experimental data
• Preparation of training and test sets (optionaly)
• Selection of an initial set of descriptors and their
normalisation
• Variables selection (optionally)
• Selection of a machine-learning method
Validation of models
• Training/test set
• Cross-validation
- internal,
- external
HO
CH3 CH3
CH3 OH
H3C
H3C
CH3
OH
OH CH3
OH CH3
Activity analysis of duplicates is also highly important to identify cases where one occurrence is
identified as ‘active’ and another one as ‘weak active’ or ‘inactive’.
CH3
HO CH3
OH
H3C
H3C INACTIVE
ACTIVE
CH3
OH CH3
OH
Structural standardization
For a given dataset, chemical groups have to be written in a standardized way, taking
into account critical properties (like pH) of the modeled system.
Aromatic compounds
OH OH
CH3
CH3
HO O O O O O– O O O OH
N N+
X X X X X
For a given dataset, these functional groups have to be written in a consistent way to
avoid different descriptor values for the same chemical group.
Normalization of carboxylic, nitro groups, etc.
removal of inorganics
ID=172
“General Style”
O
CH3
NH
OH
“Basic Style”
CH3 O
NH
OH
Building of structure -
property models
Training set
models according to
statistical criteria
Initial data set
Splitting of an initial
data set into training
and test sets
10 – 15 %
“Prediction” calculations
Test
• (ii) the activity values should span several orders of magnitude, but
should not exceed activity values in the training set by more than 10%;
References: Oprea, T. I.; Waller, C. L.; Marshall, G. R. J. Med. Chem. 1994, 37, 2206-2215
Descriptors
• Variables selction
• Normalization
descriptors
molecules
Pattern matrix
Selection of descriptors for QSAR model
QSAR models should be reduced to a set of descriptors which is
as information rich but as small as possible.
Subjective selection
Descriptors selection based on mechanistic studies
Preprocessing strategy for the derivation of models
for use in structure-activity relationships (QSARs)
i 1
n
s (1/ n) ( xij* m j ) 2
2
j
i 1
Pattern matrix
xij* m j
Normalisation 2 (Mean Centring Scaling): xij
sj
Data Normalisation
i 1
Multiple Linear Regression
Activity Descriptor
Y
Y1 X1
Y2 Y2
… … X
Yn Xn
Yi = a0 + a1 Xi1
Multiple Linear Regression
y=ax+b
Residual Sum of
Squared (RSS)
N b
RSS ( yi ycalc,i ) 2
i 1
a
Multiple Linear Regression
Descriptor 2
TRAINING SET
Descriptor 1
w, x b 1
w, x b 0
w, x b 1
2
w
1
Margin
w
Support Vector Regression
0 if
Only the points outside the ε- :
tube are penalized in a linear otherwise
fashion
Kernel Trick
In low-dimensional
input space
K ( x, x) ( x), ( x) In high-dimensional
feature space
Any non-linear problem (classification, regression) in the original input space can be
converted into linear by making non-linear mapping Φ into a feature space with
higher dimension
QSAR/QSPR models
• Development
• Validation
• Application
Preparation of training and test sets
Building of structure -
property models
Training set
models according to
statistical criteria
Initial data set
Splitting of an initial
data set into training
and test sets
10 – 15 %
“Prediction” calculations
Test
All
compounds
of the
dataset are
predicted
42
Fitting vs validation
6 6
6
3
3
3 R2 = 0.886 R2= 0.826 R2 = 0.682
0
0
RMSE = 0.97 0 RMSE = 1.20 RMSE = 1.62
-3
0 3 6 9 12 15 0 3 6 9 12 15 3 6 9 12 15
LogKexp
All molecules were used for Each molecule was Each molecule was predicted
the model preparation “predicted” in internal CV in external CV
Regression Error Characteristic (REC)
REC curves are widely used to compare of the performance of different models.
The gray line corresponds to average value model (AM). For a given model, the
area between AM and corresponding calculated curve reflects its quality.
Statistical parameters for Classification
Confusion Matrix
Classification Evaluation
accuracy (ACC)
ACC = (TP + TN) / (P + N)
FPR
Ideally, Area Under Curve (AUC) => 1
ROC (Receiver Operating Characteristics)
100%
TP 0 1 2 3 FP a b c d
4 5 6 7 8 9 e f g h i j
FN 0 1 2 3 TN a b c d
4 5 6 7 8 9 e f g h i j
Ideal model:
TP% AUC=0.84
AUC=1.00
j g 0 5
a 2 c 8
Useless model: b 1
3
h d
AUC=0.50 6 4
e f
i 7 9
0% FP% 100%
When a model is accepted ?
3 classes
2,000 1
1,500 0.75
1,000 0.5
R2
0.0 1.0
Y-Scrambling
(for methods without descriptor selection)
X1 Y1 Y4
X2 Y2 Y1
X3 Y3 Y5
X4 Y4 Y2
X5 Y5 Y6
X6 Y6 Y3
X7 Y7 Y7
R2
0.0 1.0
Y-Scrambling
(for methods without descriptor selection)
X1 Y1 Y7
X2 Y2 Y6
X3 Y3 Y3
X4 Y4 Y5
X5 Y5 Y4
X6 Y6 Y1
X7 Y7 Y2
R2
0.0 1.0
QSAR/QSPR models
• Development
• Validation
• Application
QSPR Models Test compound
Prediction Performance
TRAINING SET
Descriptor 1
= TEST COMPOUND
Will be predicted
Will not be predicted
Applicability
Applicability domain of QSAR models
Domain Approaches
Single hunter
Hunting season …
Many hunters
Ensemble modelling
Ensemple modeling
Y1 Y2 Y3
1 n
Consensus = Y i
n i 1
Screening and hits selection
Database
O
COOH
Cl
Virtual
Br
OH
N
OH
Sreening
N
OH
QSPR model N
COOH
Hits
Useless
compounds
Experimental
Br
O
Tests