Anda di halaman 1dari 12

WWW.C-CHEM.

ORG SOFTWARE NEWS AND UPDATES

QSARINS: A New Software for the Development, Analysis,


and Validation of QSAR MLR Models
Paola Gramatica,* Nicola Chirico, Ester Papa, Stefano Cassani, and Simona Kovarich

QSARINS (QSAR-INSUBRIA) is a new software for the develop- for visualizations are implemented. QSARINS is a user-friendly
ment and validation of multiple linear regression Quantitative platform for QSAR modeling in agreement with the OECD Prin-
Structure-Activity Relationship (QSAR) models by Ordinary ciples and for the analysis of the reliability of the obtained
Least Squares method and Genetic Algorithm for variable predicted data. The Insubria Persistent Bioaccumulative and
selection. This program is mainly focused on the external vali- Toxic (PBT) Index model for the prediction of the cumulative
dation of QSAR models. Various tools for explorative analysis behavior of new chemicals as PBTs is implemented. Addition-
of the datasets by Principal Component Analysis, prereduction ally, QSARINS allows the user to validate single models, prede-
of input molecular descriptors, splitting of datasets in training veloped using also different software. V
C 2013 Wiley Periodicals,

and prediction sets, detection of outliers and interpolated or Inc.


extrapolated predictions, internal and external validation by
different parameters, consensus modeling and various plots DOI: 10.1002/jcc.23361

Introduction Workflow of the QSAR Approach in QSARINS


Quantitative Structure-Activity Relationship (QSAR) models, The main steps involved in the development and analysis of a
when correctly developed and rigorously validated, are highly QSAR model can be summarized as: (1) Preparation, analysis,
useful for screening and prioritization of chemicals without and setup of the input dataset, (2) Model calculation based on
experimental data or even before their synthesis in the safe selected descriptors, (3) Model exploration, validation, and
chemical design approach. Their use in regulation was also selection.
suggested in the European legislation of chemicals REACH,[1] The first step in QSAR modeling is the preparation of a suit-
in particular to reduce experimental costs and tests on ani- able dataset, which is then used as the input for the QSAR
mals. However, the users of QSAR models should be aware model generation by means of statistical analysis. This dataset
that QSAR modeling is not a trivial approach and that it is not consists of the experimental responses to be modeled and the
sufficient to push a button and find a correlation.[24] In corresponding set of molecular structure descriptors for each
recent years, particular attention has been devoted by the chemical, which are calculated using appropriate software. The
QSAR community to the validation of QSAR models, with par- preparation of a good input dataset is the first crucial point in
ticular emphasis on applicability domain (AD) and predictiv- any QSAR approach; the relevance of data curation has been
ity,[513] and the famous OECD principles for the validation of the subject of relevant papers.[24] Usually, the number of cal-
QSARs models for their application in regulation[14] have been culated descriptors is high (hundreds, sometimes even thou-
established to increase the confidence on the reliability of sands),[22,23] to have the possibility to represent different
data predicted by QSAR models. Various parameters for check- features of the chemical structure in different ways. Thus, a lot
ing the model predictivity, and specifically, for external valida- of them could be intercorrelated and redundant giving very
tion on chemicals not used in model development, have been similar structural information. QSARINS eases the objective pre-
suggested.[5,6,1521] In a recent comparison,[20,21] several prob-
reduction of the molecular descriptors by means of a set of
lems for some of these parameters have been highlighted and
options (listed in the following section Importing and reduc-
the use of more than one validation criterion and of graphical
ing input dataset).
plots has been proposed. However, it is not common to have
The next important step is the dataset analysis, i.e., the
different statistical parameters for validation, nor a pool of use-
check of the data distribution in the chemical and
ful graphical visualizations contemporaneously calculated in a
single software. We here propose a new software, QSARINS
(QSAR-INSUBRIA), based on the experience of the QSAR
Research Unit of Insubria that allows to develop multiple linear P. Gramatica, N. Chirico, E. Papa, S. Cassani, S. Kovarich
regression (MLR) QSAR models, by ordinary least squares Department of Theoretical and Applied Sciences, QSAR Research Unit in Envi-
ronmental Chemistry and Ecotoxicology, University of Insubria, Via Dunant 3,
(OLS), carefully verified and thoroughly validated according to
21100, Varese, Italy. E-mail: paola.gramatica@uninsubria.it
the chemometric approach. In addition, QSARINS implements Contract grant sponsor: European Union through the project CADASTER
various tools for multivariate analysis, applicable for the explor- FP7-ENV-2007-1-212668; Contract grant sponsor: Regione Lombardia and
ative study of data at different steps of the modeling process University of Insubria for a post-doc fellowship (to N.C.)
and for decision making in the selection of the best models. C 2013 Wiley Periodicals, Inc.
V

Journal of Computational Chemistry 2013, 34, 21212132 2121


SOFTWARE NEWS AND UPDATES WWW.C-CHEM.ORG

Figure 1. QSARINS workflow.

experimental space, highlighting possible outliers and/or par- predictions on new chemicals, it must have good performan-
ticular clusters (see sections Dataset analysis and Data ces in predicting the external dataset(s): this procedure is
setup). called external validation. All calculated models can be
In the QSAR practice, a school of thought and the one fol- checked, selected, filtered (see section View and select mod-
lowed here recommends to develop and propose models els), and validated using different techniques within QSARINS
always verifying their predictivity on new chemicals,[5,6,1013] (see section Model validation).
using a dataset, that is, training set, for establishing the quan- After performing all these steps in QSARINS, a QSAR model
titative structure-activity relationship and then validating the can be selected as the best and various graphs for verifying
models performances by means of one or more external data- the predicted data distribution, the errors, the AD, etc. can be
sets, i.e. prediction set(s) (which has not been used in the plotted. The selection of the best model can be done consid-
models calculations). The training and the prediction sets are ering various validation parameters altogether by the Multicri-
extracted from the available experimental dataset, and QSAR- teria Decision Making (MCDM).[25] Since the GA-MLR approach
INS provides a set of tools for the data splitting (see section allows to obtain several similarly good models, but based on
Data setup) procedure. different descriptors, these models can be cumulatively
Once the data selection has been performed, the next step exploited to obtain better predictions by consensus.[26] QSAR-
is the development of the QSAR model. Models are calculated INS has two different tools, based on Principal Component
in QSARINS by means of the MLR, using the OLS method (see Analysis (PCA), for the selection of the most diverse models to
section Model calculation). The model descriptors can be a be averaged in a consensus modeling. The schematic workflow
priori selected by the modeler or by the software. It would be of the steps followed by QSARINS and the main available
ideally possible to use all the combinations of the available methods for each one are reported in Figure 1.
descriptors for models calculation; however, the number of The aim of a model is also to be applicable for making predic-
combinations is usually so big, that it is impossible to calcu- tions in the future when data for new chemicals become avail-
late the models for them all. Thus, to reduce the time for able: in QSARINS it is also possible to store any single model,
model calculation, it is here proposed to use preliminarily even developed by other software, to validate it by various
all combinations (all subset models) only for a small number parameters, and to apply the model later to new chemicals.
of descriptors per model to explore all the low dimension
combinations, and then to apply a genetic algorithm (GA)[24]
procedure for the development of models based on a big- Preparation, Analysis, and Setup of the Input
ger number of descriptors (see section Descriptors Dataset
selection).
Importing and reducing input dataset
During the GA selection, a long list of calculated models is
obtained: the next step is to select the best one. This analy- A typical input dataset for QSAR analysis is, in its basic form, a
sis is complex, because model performances vary depending matrix containing the studied chemicals in the rows and the
on the different criteria which are used to evaluate them. corresponding descriptors values (independent variables X, cal-
First, a model must have at least a high ability to reproduce culated by means of suitable software e.g., DRAGON,[22]
the data used to calculate it, that is, a good fitting. Then, a PaDEL-Descriptor,[23] or others) and the experimental responses
model must be verified to have a high capacity to predict por- (dependent variable Y: the end-points to be modeled) in the
tions of the training dataset, using techniques collectively columns.
known as cross-validation (CV) or internal validation. If the To mitigate the redundancy of intercorrelated descriptors, a
model passes this verification, the model can be defined as prereduction of descriptors is performed based on an objec-
robust and stable. To guarantee its ability in providing reliable tive selection that uses only the independent variables X. The

2122 Journal of Computational Chemistry 2013, 34, 21212132 WWW.CHEMISTRYVIEWS.COM


WWW.C-CHEM.ORG SOFTWARE NEWS AND UPDATES

Figure 2. Import data dialog. More details in Supporting Information.

descriptors to discard are identified by: (1) tests of identical val- These kinds of data inspections can be performed at any step
ues (constant variables); (2) pair-wise correlations (according to in the QSAR model development, to continuously check the
a user defined cutoff value; suggestion: 95%) calculating the cor- quality of the data selected by the subsequent procedures.
relation among all couples of descriptors and, if a couple is To further explore the distribution of the studied com-
found to be highly correlated, the descriptor with the higher pounds in the chemical structural space, a multivariable
correlation with the other descriptors is deleted. Descriptors can explorative analysis can be performed by PCA of molecular
also be deleted if the percentage of the compounds sharing the descriptors in the data setup section.
same value is too high (suggestion: 80%).
This last option is useful for counter descriptors (e.g., num- Data setup
ber of carbon atoms, number of a functional group, etc.), for
In this section, the QSAR developer can select the compounds
instance in cases where only few chemicals have a certain
and the descriptors for the development of the model. The
functional group that could be selected just for fitting the
Data setup dialog allows for the selection of: (1) the molecular
data, but would not be useful for future predictions.
descriptors to include in the subsequent variable selection pro-
Sometimes the input dataset contains special columns rela-
cedure, (2) the response (endpoint) to be modeled and (3) the
tive to the dataset splitting predefined by the modeler, report-
status of the molecules under study (training, prediction,
ing which compounds will be used for model calculations (1:
excluded) (see Fig. 4).
training set), which will be used for its validation (2: prediction
Every molecular descriptor can be normalized over the
set(s)), and which could be excluded from computations (0:
range of values of all the corresponding chemical compounds,
excluded). In these cases, it is possible to verify the correlation
by pressing the Normalize Var button.
of the descriptors and the consequent selection only using the
At this step, it could be useful to explore the chemical space of
training set chemicals.
the studied dataset by the PCA of the chosen descriptors (see Fig.
QSARINS provides all the aforementioned procedures, as
5). By means of a score plot (upper graph in Fig. 5), it is possible
reported in the Import data dialog of Figure 2.
to check the existence of molecules clustered by similar struc-
tures and/or to detect strong structural outliers, while the loading
Dataset analysis
plot (lower graph in Fig. 5) allows to identify those descriptors
Once data are imported, they can be inspected by various that highly influence the occurrence of these clusters/outliers.
graphs (see examples in Fig. 3). In fact, before the data setup In addition, the QSAR developer (or the user) can apply in
and modeling, it is useful and suggested to display the profile QSARINS three different kinds of splitting of the compounds in
and distribution of the compounds in the dataset, in terms of the dataset: (1) completely random splitting, setting a pre-
structural and experimental domains. The distribution of the ferred random percentage of the compounds to be put aside
compounds for each descriptor and for each response is plot- in the prediction set; (2) by ordered response; (3) by structure,
ted/visualized in View variables/objects profiles. The correla- ordering the PC1 score of the molecular descriptors. In options
tion among data, and the scatter plots of each couple of data, 2 and 3, the first and the last chemicals in the training set are
can be also visualized. by default always included.

Journal of Computational Chemistry 2013, 34, 21212132 2123


SOFTWARE NEWS AND UPDATES WWW.C-CHEM.ORG

Figure 3. View Data dialog. The main dialog on the upper position reports tabulated data and allows graphical display options. The dialog on the middle
is an example of data scatter plot in the space of two arbitrary molecular descriptors. The dialog on the lower position plots the distribution of the experi-
mental data.

At this step, the modeling descriptors could be selected is the random error (called also model residual). The intercept
either a priori by the QSAR developer, or automatically by the (b0 ) and the coefficients (bj ) are thus to be estimated.
procedure explained in the Descriptor selection section. The eq. (1) can be rewritten in a more compact form using
the matrix notation:

Model Calculation Based on Selected y5Xb1e (2)


Descriptors
where y is the responses vector, b the vector of the coeffi-
Modeling method in QSARINS cients, and e the vector of the errors. X is the matrix of the
model, where the columns are the descriptors.
The datasets used in QSAR analysis are, as already mentioned,
In this software, to estimate the vector of the coefficients,
composed of descriptors that should be correlated with the
the OLS technique is used:
corresponding experimental responses. At this step it is neces-
sary to apply a quantitative method able to find the existing _  21
relationship between a limited number of structural descrip- b 5 XT X XT y (3)
tors and the modeled response. _

In QSARINS, the used method is the MLR approach that can where b is the vector that estimates the b vector of the coef-
be exemplified by the following formula: ficients, XT the transposed X matrix and 21 is the inverse
matrix operation.
Xn
The OLS minimizes the sum of squares of the difference
yi 5b0 1 bj xij 1ei (1)
j51
between the experimental responses and the ones calculated
by the model. To work properly, the OLS assumes that: (1) a
where a linear relationship is computed between the studied linear relationship exists between the descriptors and the
responses (yi ) and the selected values of the descriptors (xij ); ei response, (2) the response errors are independent and similarly

2124 Journal of Computational Chemistry 2013, 34, 21212132 WWW.CHEMISTRYVIEWS.COM


WWW.C-CHEM.ORG SOFTWARE NEWS AND UPDATES

Figure 4. Data setup dialog. More details in Supporting Information.

Figure 5. Examples of PCA of data. The upper dialog is a score plot of the molecular compounds while the lower dialog is the corresponding loading plot
of the selected descriptors. Both axes report the chosen PCA components.

Journal of Computational Chemistry 2013, 34, 21212132 2125


SOFTWARE NEWS AND UPDATES WWW.C-CHEM.ORG

distributed, (3) the descriptors are not too correlated among models with better performances than the models introduced
them (for this reason QSARINS control this point both in the at the beginning.
preselection step and by the QUIK rule,[27] see the section In order to avoid a completely random beginning of the GA,
Filtering during calculation), (4) there are more compounds in QSARINS, the best set of descriptors extracted from the all
than modeling descriptors (a ratio that should be always subset procedure is used as the core of the chromosomes of
higher than 5:1). the initial population.
Once the coefficients of the model are calculated, it is possi- In QSARINS, the tuning of the GA can be done varying the
ble to obtain the vector of the y^, as in the following formula: population size, the mutation rate, and the number of
generations.
_  21
y^ 5Xb 5X XT X XT y5Hy (4) A fundamental option is the selection of the fitness function
to be used by GA. This function is used to calculate the per-
where H is the leverage (or hat) matrix that relates the calcu- formance of each model: in QSARINS, R2adj , LOF,[29] Root Mean
lated and the experimental responses. The diagonal elements Square Error in cross validation (RMSECV ), and Q2LOO are pro-
of the hat matrix hii are useful to determine the distance of vided (all formulas are reported in details in Supporting Infor-
the i object from the centre of the chemical space of the mation). R2adj and LOF detect the possible overfitting of a
model,[68,11,28] thus, for checking the structural AD of the model so, used as fitness functions, are useful to select models
model (see in the Single model details section). with high fitting with the minimum number of descriptors.
However, it is important to note that they are fitting criteria,
so they provide no information on the predictive ability of the
Descriptor selection
models. For this reason, it is here suggested to use Q2LOO as fit-
After setting and exploring the data, a pool of molecules with ness function for the selection of predictive models. The graph
the corresponding descriptors and the studied end-point that plots Q2LOO vs LOF can help in the selection of the models
becomes available for model calculation. QSAR models are with the best compromise between high internal predictivity
characterized by their dimension, that is, the number of mod- and small dimension.
eling descriptors, and a limited number of modeling descrip- Alternatively to the all subset and GA techniques, QSARINS
tors, related to the studied response, must be selected from also provides the possibility to add iteratively, one by one, all
the available pool. Calculating all models corresponding to all the descriptors to a known model (whose descriptors are kept
descriptors combinations is feasible only for small dimensions fixed in the process). This allows checking whether there is an
(often 23): this technique is called all subset and is suggested improvement in the model performances due to such
in the QSAR practice, having verified in our experience the util- additions.
ity of exploring all the low dimension models before applying
any selection tools. Filtering during calculation
If satisfying models are not obtained by the all subset pro-
cedure and higher dimension models are needed, a different During a typical QSAR session, thousands of alternative models
method, able to find the best combinations without exploring are calculated following the descriptors selection. In order to
them all, should be applied: in QSARINS this is done using a avoid useless cluttering of the final output model list, it is pos-
GA procedure.[24] This technique is able to explore a wide sible to exclude them beforehand if they do not pass a mini-
range of solutions, searching for the best ones, by maximizing mum quality test. If requested by the QSAR developer, the
(or minimizing) a selected fitness function. This is done mim- QUIK rule[27] can be applied. This method tests whether the
icking the natural selection, where the best solutions replace total correlation among the block of descriptors (KXX ) is higher
the less performing. In biological terms, one would say that than the correlation among them and the responses (KXY ), that
the best genes in the population replace the less fitting. In is, a model is excluded if KXY 2KXX < dK , where dK is a user
our case, every descriptor represents a gene, and a set of defined threshold value. In fact, a model makes sense if the
descriptors represents a chromosome. The fitness of a chromo- correlation among the descriptors is smaller than the correla-
some is related to the corresponding model performances. tion among them and the response, that is, the model has low
Starting with a pool of chromosomes, small subsets of chro- multicollinearity and good correlation with the modeled
mosomes are picked at random, and the best become parents response. Thus, dK must be positive and should be as high as
(this technique, used in QSARINS, is called tournament selec- possible.
tion[24]). Couples of parent chromosomes are then crossed at a
random position (crossing-over), thus obtaining the offspring, Model Exploration, Validation, and Selection
whose chromosomes are a mixing of the parent ones. If
View and select models
among the new chromosomes one or more of them outper-
form the less fitting in the parent population, these chromo- After calculation and storage of the models, it is essential to
somes will substitute the less performing. Repeating the display them in a form that makes their sorting and further
aforementioned procedure many times, and introducing also selection (see Fig. 6) user friendly. In QSARINS, it is possible to
random mutations (descriptor substitution) in the chromo- sort the calculated models, visualizing them all or only a
somes, the result at the end of the process is a population of selected portion of the best models, according to various

2126 Journal of Computational Chemistry 2013, 34, 21212132 WWW.CHEMISTRYVIEWS.COM


WWW.C-CHEM.ORG SOFTWARE NEWS AND UPDATES

Figure 6. View and select models dialog. More details in Supporting Information.

criteria: fitting (highest R2 ), robustness (highest Q2LOO ), stability hat matrix diagonal) higher than the cutoff value h* 53p0 /n,
(lowest R2 -Q2LOO ), the lowest descriptor correlation and the where p0 is the number of model variables plus one, and n is
highest correlation with the response (low KXX and high dK ), the number of the objects used to calculate the model.[68,11,28]
and the root mean squared errors on (a) training calculation
(RMSETR), (b) training prediction by leave one out (LOO)
Model validation
(RMSECV ), and (c) external prediction set (RMSEEXT ). It is
obvious that the most stable, predictive, and more generaliz- Model calculation is the basic step in QSAR analysis, but is not
able model must have the lowest difference between the fit- sufficient to guarantee the model validity: it is essential to
ting, the CV and the external parameters and, as a check various model performances, that is, fitting, stability by
consequence, have RMSE values as similar as possible. CV, and capability to predict new chemicals (external valida-
An additional and interesting criterion for an overall evalua- tion).[5] In addition, it is important to verify that the model is
tion of the model population is to assess the relative impor- not obtained by chance. These performances are measured by
tance of the modeling descriptors, evaluated by counting their appropriate formulas/methods of various validation
occurrences among all listed models. The more often a criteria.[20,21]
descriptor is represented among models, (even if in combina- The model fitting is evaluated by the coefficient of determi-
tion with other alternative descriptors), the more it is likely to nation R2 , and a modified form R2adj which also assesses the
be a good modeling descriptor. In addition, the more repre- convenience of adding a new descriptor to a model. Both
sented descriptors could also be used for a subsequent proce- these criteria evaluate how well the model reproduces the
dure, that is, an all subset calculation, to further explore other data used to create it and this is an essential step in model
combinations. evaluation. However, these criteria say nothing about the abil-
It is here also proposed to prefer models with the lowest ity of the model to predict new data or whether the model is
number of bad predictions (Y-outliers) and chemicals far from just the result of chance correlation. If a model is unstable,
the training structural domain (X-outliers, out of the structural small changes in the data lead to big changes in the predic-
AD, also for unknown chemicals without experimental data). tions, and the checking of model robustness is called internal
QSARINS allows for sorting the models according to these validation.
counts. The Y-outliers are those with standardized residuals A well known criterion for internal validation, widely used in
higher than a user definable threshold (suggested 2.53r), the QSAR practice, is Q2LOO applying the CV technique, that is,
while X-outliers are those with a leverage value (taken from the by iteratively excluding one compound from the dataset

Journal of Computational Chemistry 2013, 34, 21212132 2127


SOFTWARE NEWS AND UPDATES WWW.C-CHEM.ORG

Figure 7. Example of plot of LMO validations and Y-scrambled models compared with the original model.

(LOO), and then computing a model with the remaining com- the original model under validation is good, the values of R2
pounds, making a prediction for the excluded one. If internal and Q2 of every iteration, and their averages (R2YS and Q2YS ),
predictions are good, that is, Q2LOO has a high value compara- must be far and much smaller from the values of the original
ble to R2 , the model is considered to be internally stable or model (see lower graph in Fig. 7).
robust. However, a perturbation of only one compound at a To compare the different R2 and Q2 values in the x axis of
time is too weak to demonstrate real model robustness. In Figure 7, the correlation between the model descriptors and
QSARINS, the stronger Leave-More (or Many)-Out (LMO) tech- the response (KXY ) is reported. It is intuitive that the compared
nique is also included: this technique studies the behavior of values must be very similar among them and also with compa-
the model when a larger number of compounds is excluded. rable KXY values in the LMO procedure, while in Y-scrambling
Random samples of a defined percentage of compounds are the scrambled models must not only have R2 and Q2 values
excluded iteratively, the model is calculated using the remain- lower then the original model, but also lower KXY values.
ing data, and then its performances are calculated predicting In addition to the aforementioned techniques, two varia-
the excluded compounds. The model under analysis can be tions of the classical Y-scrambling are applied in QSARINS: ran-
considered stable if the R2 and Q2 values calculated in every dom responses and random descriptors.[30] In the random
LMO iteration, and their averages (R2LMO and Q2LMO ), are close responses procedure, random values are picked within the
to the R2 and Q2LOO values of the model (see upper graph in range of the responses and used as end-points for new model
Fig. 7). calculations, instead of randomly shuffling the actual responses
To demonstrate that the model is not the result of chance as in the classical Y-scrambling. The same logic is used in the
correlation, the Y-scrambling procedure can be applied. In this random descriptors procedure, applied to demonstrate that
procedure, the responses are shuffled at random, so no corre- models based on chance numbers are very different from the
lation between them and the descriptors should exist. As a original model: random values are chosen within an arbitrary
consequence, the performances of the corresponding range, comparable with true values found for descriptors. As a
scrambled models should decrease drastically. In this case, if consequence, for original successful models not obtained by

2128 Journal of Computational Chemistry 2013, 34, 21212132 WWW.CHEMISTRYVIEWS.COM


WWW.C-CHEM.ORG SOFTWARE NEWS AND UPDATES

Figure 8. Single model dialog. More details in Supporting Information.

chance, such techniques destroy the structure among data aims, and application. To help the user selection, exploiting
and, as a consequence, the performances of the scrambled/ contemporaneously all the available information regarding the
random models are very low. model validation, the sort of the models according to the
Once a model is internally validated and chance correlation MCDM procedure[25] is implemented in this software. This pro-
is excluded, external validation can be performed, that is, the cedure can be applied for fitting, cross validation, and external
model is checked for its ability to predict new compounds. validation criteria separately, or together.
This is done by applying the model equation, obtained on the After the above evaluations, a manual selection of the mod-
training set, to one or more prediction data set(s), that is, the els can be also performed. The user can decide to choice only
excluded compounds that have never been used in model cal- one model and to use it for prediction of new additional
culation, and measuring the performances by means of differ- chemicals (in Single model use and implementation section)
ent criteria, such as: RMSEEXT,[9] Q2F1 ,[15] Q2F2 ,[16] Q2F3 ,[17] rm
2 plus or to select a certain number of good models, with similar per-
2 [18] [1921]
Drm , CCC, and the Golbraikh and Tropsha method.[5] formances, but based on different descriptors, which could
Such a set of different criteria, and not just one, are here cal- then be used in a subsequent consensus modeling (see sec-
culated and compared because we have analyzed them all in tion Consensus modeling), that allows to obtain a combined
depth and demonstrated that they are not always in agree- average model, which usually provides better predictions.
ment in measuring the model predictivity.[20,21] Thus, it would
be best to investigate them all before selecting a model for its
Single model details
performances in external predictivity.
Since the models can be evaluated by means of different The QSAR developer needs to explore the models further in
validation criteria, a method to evaluate model performances full details. This can be done by: (1) looking at the model
is to count how many validation criteria are fulfilled. This can equation, (2) verifying the relevance of each descriptor by the
be done by first setting a threshold of acceptance for every standardized coefficients, (3) checking the confidence intervals
validation criterion (default values are set in QSARINS, as sug- for each descriptor, (4) verifying the values of the various crite-
gested in our previous work,[21] in a designated dialog box, ria used for the validation, (5) looking at the statistics concern-
but can be changed by the user). ing all compounds both in tabulated and in plotted form, as
However, it is important to note that the selection of the reported in Figures 8 and 9. At this point all the information
best models depends on the QSAR modeler preference, of the studied chemicals (e.g., ID, CAS, experimental values,

Journal of Computational Chemistry 2013, 34, 21212132 2129


SOFTWARE NEWS AND UPDATES WWW.C-CHEM.ORG

Figure 9. Upper left graph: scatter plot of experimental endpoint data versus predicted by LOO. Upper right graph: plot of experimental data versus resid-
uals (experimentalpredicted data by LOO). Lower left graph: Hat diagonal values versus standardized residuals (Williams plot). Lower right graph: Hat
diagonal values versus predicted data (Insubria graph).

predictions, hat values, residuals, and standardized ones) are (c) the Williams plot for the study of the model AD of training
displayed. and prediction chemicals, highlighting the X- and Y-outliers,
One first supporting tool for the assessment of the model and (d) the Insubria graph for the study of the structural AD
quality is the correlation matrix of the used descriptors: in on compounds without experimental data.[4] All the above
addition to the already mentioned KXX value, it is proposed to mentioned graphs of validation techniques can be produced
verify the degree of correlation among each couple of model- by QSARINS, either by applying the model equation or the
ing descriptors in each model. LOO technique.
A deeper analysis of a QSAR model can be done by visualiz- The scatter plot of the experimental versus the predicted
ing a series of informative plots, that are useful in finding responses (see upper left graph in Fig. 9) easily detects sys-
some specific anomalies:[21] (a) the scatter plot of the experi- tematic trends or data clustering and, if any, strong outliers in
mental versus the predicted data, (b) the plot of the residuals the data.

2130 Journal of Computational Chemistry 2013, 34, 21212132 WWW.CHEMISTRYVIEWS.COM


WWW.C-CHEM.ORG SOFTWARE NEWS AND UPDATES

The plot of the residuals (see upper right graph in Fig. 9) QSARINS by the PCA of the prediction residuals[26] or of the
allows to evaluate the deviations from the ideal matching different modeling molecular descriptors. The consensus pre-
between experimental and predicted data and to detect dicted data are obtained by averaging the predictions of the
anomalous trends. selected models (both applying the model equation in fitting
The Williams plot detects the outliers for the response (Y- and in LOO cross validation). The supporting statistics such as:
outliers) and those for the structure (X-outliers). It consists of standard deviations, the minimum and maximum values, and
plotting the standardized residuals on the y-axis and the lever- the average of the diagonal hat matrix values (these values
age values from the hat matrix diagonal on the x-axis (see allow to understand if the chemicals are outside the AD of the
lower left graph in Fig. 9). Concerning the residuals, all the consensus model) are also provided. The predictions from
chemicals falling above or below the user defined threshold each single model can be also averaged, using different
are not well predicted and thus considered as outliers. Too weights in the Weighted Consensus Modeling.[33] Finally, the
many outliers, particularly those underestimated, are sympto- AD of the consensus model can be plotted as Williams plot
matic of a poor model and this is the reason of implementing and/or Insubria graph.
the counting of the outliers, as reported in the View and
select models section. Leverage values represent the degree Single model use and implementation
of influence that the structure of every single chemical has on
The aim of the QSAR practice is to calculate a model able to
the model. A compound with high leverage in a QSAR model
make good predictions for new chemicals, also for not yet syn-
is the driving force for the variable selection if this compound
thesized ones (chemical design). QSARINS eases this task in two
is in the training set (good leverage). A high leverage com-
steps. The first is the possibility to store any MLR-OLS model
pound in the prediction set is detected as far from the chemi-
(also developed by other software) for a later application. The
cal domain of the training compounds, thus it could lead to
second is the possibility to obtain predictions for new chemicals
unreliable predicted data, being the result of substantial
automatically applying the model developed in QSARINS. In
extrapolation of the model. Therefore, the structural informa-
both cases it is possible to verify the model performances by
tion of the chemicals included in the training set could be not
the various validation criteria and plots available in the software.
sufficient for a reliable prediction of chemicals lying outside of
The Persistent Bioaccumulative and Toxic (PBT) Index model,
the training-AD.
developed within our research group,[34] for the prediction of
Any QSAR model can also be applied to predict the end-
the cumulative behavior of new chemicals as PBTs, is provided
point for the molecules lacking an experimental response: it is
by default as a special feature of QSARINS (current version
always possible to evaluate their position in comparison to the
1.2). The original and published PBT Index model was based
structural AD and compare their predictions with those of the
on DRAGON descriptors. The molecular descriptors were
compounds that have experimental values. To visually explore
selected by GA and their ability in predicting chemicals, not
them, the Insubria graph, which plots the hat diagonal values
participating to model development (training: 54 chemicals)
versus the predicted responses, (see lower right graph in Fig.
was preliminarily verified on external chemicals (126 chemi-
9) has been recently proposed by our group.[4]
cals): Q2F1 5 80.72%; R2EXT 5 89.27%; RMSEEXT 5 0.72. We had
also applied this model to an additional 70 chemicals without
Consensus modeling PBT Index values for the prediction of their PBT behavior.
The model now implemented in QSARINS software is the
A single QSAR model is not likely to include all the aspects of
following, redeveloped using the same descriptors calculated
the molecular structures, since it is built up using a small sub-
by the open-source software PaDEL-Descriptor:[23]
set of the available descriptors. An individual QSAR model may
overemphasize some aspects, underestimate others, and
PBT Index 5 21.46 (60.10) 1 0.64 (60.03) nX 1 0.22 (60.01)
ignore many important features completely. In the QSAR mod-
nBondsM0.39 (60.07) nHBDon_Lipinksi0.06 (60.03) MAXDP2
eling approach based on GA-MLR, as implemented in QSAR-
INS, a long list (a population) of models, with similar good
n5180; R2 588:89%; Q2LOO 588:25%; Q2LMO 588:20%;
performances and based on different descriptors, is available.
R2YS 50:02; RMSETR 50:51
The descriptors selected in the various models are usually dif-
ferent, thus potentially representing multiple aspects of the Additional models generated by the Insubria group and the
molecular structures. It is thus reasonable to make an average related QSAR Model Reporting Format[35] will be included in
model or consensus model from the chosen list, in order to the future updated versions of QSARINS. Furthermore, a data-
make more accurate predictions. This consensus modeling can base of 3D chemical structures, and experimental endpoints,
take into account simultaneously several peculiar aspects of collected and modeled by the Insubria group in the last 15
some particular chemical structures. The consensus QSAR years, will be also added in QSARINS.
modeling has already demonstrated its utility in many QSAR
studies[26,31,32] providing better predictive ability than the
Conclusions
majority of individual models.
To catch as much variation as possible among the list of QSARINS is proposed as a new software for the calculation,
models, the selection of the more diverse models is feasible in analysis, validation, and application of QSAR MLR-OLS models

Journal of Computational Chemistry 2013, 34, 21212132 2131


SOFTWARE NEWS AND UPDATES WWW.C-CHEM.ORG

according to the OECD principles. It allows to import and [4] P. Gramatica, S. Cassani, P.P. Roy, S. Kovarich, C.W. Yap, E. Papa, Mol. Inf.
reduce input data by automated procedures and to visualize 2012, 31, 817.
[5] A. Golbraikh, A. Tropsha, J. Mol. Graph. Model. 2002, 20, 269.
the results both in tabulated and graphical form. A flexible [6] A. Tropsha, P. Gramatica, V.K. Gombar, QSAR Comb. Sci. 2003, 22, 69.
data setup system is then proposed before the calculation of [7] L. Eriksson, J. Jaworska, A. Worth, M. Cronin, R.M. McDowell, P.
the models, including an explorative study of the chemicals Gramatica, Environ. Health Perspect. 2003, 111, 1361.
[8] T.I. Netzeva, A.P. Worth, T. Aldenberg, R. Benigni, M.T. D. Cronin, P.
space by PCA (based on molecular descriptors) and different
Gramatica, J.S. Jaworska, S. Kahn, G. Klopman, C.A. Marchant, G. Myatt,
approaches for the dataset splitting in training and prediction N. Nikolova-Jeliazkova, G.Y. Patlewicz, R. Perkins, D.W. Roberts, T.W.
sets. Descriptors selection is then performed by different Schultz, D.T. Stanton, J.J. M. van de Sandt, W. Tong, G. Veith, C. Yang,
methods including the all subset and GA techniques. Once the ATLA 2005, 33, 155.
[9] O.A. Aptula, N.G. Jeliazkova, T.W. Schultz, M.T. D. Cronin, QSAR Comb.
models are obtained, they can be listed and filtered using dif-
Sci. 2005, 24, 385.
ferent methods. Models can be evaluated as a comparative list [10] T. Oberg, Atmos. Environ. 2005, 39, 2189.
or singularly for a more detailed analysis. Different validation [11] P. Gramatica, QSAR Comb. Sci. 2007, 26, 694.
techniques can also be applied to evaluate their performances. [12] T. Puzyn, A. Gajewicz, A. Rybacka, M. Haranczyk, Struct. Chem. 2011,
22, 873884.
Strong emphasis has been devoted to the external validation [13] T. Puzyn, A. Mostrag-Szlichtyng, A. Gajewicz, M. Skrzynski, A.P. Worth,
by comparing various criteria, by combining them using the Struct. Chem. 2011, 22, 795.
MCDM procedure, and to the visual inspection of the model [14] OECD Principles (2004). Available at: http://www.oecd.org/dataoecd/
data by informative plots. 33/37/37849783.pdf, accessed on February 13, 2013.
[15] M. Shi, H. Fang, W. Tong, J. Wu, R. Perkins, R.M. Blair, W.S. Branham,
After model evaluation, a consensus modeling based on the S.L. Dial, C.L. Moland, D.M. Sheehan, J. Chem. Inf. Comput. Sci. 2001,
best of them can be performed in order to obtain usually 41, 186.
more reliable predictions. The best models can be saved for [16] G. Sch rmann, R. Ebert, J. Chen, B. Wang, R. K
uu uhne, J. Chem. Inf.
future application. In QSARINS, there is the possibility to rede- Model. 2008, 48, 2140.
[17] V. Consonni, D. Ballabio, R. Todeschini, J. Chemometrics, 2010, 24, 194.
velop, check, validate, and apply to any new chemicals single [18] P.K. Ojha, I. Mitra, R.N. Das, K. Roy, Chemom. Intell. Lab. Syst. 2011, 107,
MLR models, previously developed also by other tools. 194.
The PBT Index model is here implemented for the detection [19] L.I. Lin, Biometrics 1989, 45, 255.
[20] N. Chirico, P. Gramatica, J. Chem. Inf. Model. 2011, 51, 2320.
of PBT chemicals from their structural information only. QSAR-
[21] N. Chirico, P. Gramatica, J. Chem. Inf. Model. 2012, 52, 2044.
INS, that runs under Microsoft Windows 2000 or more recent [22] DRAGON for Windows (Software for molecular Descriptor Calculation),
versions, is available free of charge for academia and research Talete srl. Available at: http://www.talete.mi.it, accessed on February
centers only, and can be requested from the authors by 13, 2013.
[23] PaDEL Descriptor. Available: http://padel.nus.edu.sg/software/padelde-
e-mail. Further details are available in the web site: www.qsar.it
scriptor, accessed on February 13, 2013.
and in the help-function in the software. [24] R.L. Haupt, S.E. Haupt, In Practical Genetic Algorithms, 2nd ed.; Wiley,
New Jersey, 2004.
Acknowledgement [25] H.R. Keller, D.L. Massart, J.P. Brans. Chemom. Intell. Lab. Syst. 1991, 11,
175.
Dr. Leon van der Wal is acknowledged for his critical comments on [26] P. Gramatica, P. Pilutti, E. Papa, J. Chem. Inf. Comput. Sci. 2004, 44,
1794.
the manuscript and English revision.
[27] R. Todeschini, V. Consonni, A. Maiocchi, Chemom. Intell. Lab. Syst.
1999, 46, 13.
Keywords: QSAR  modeling software  OLS regression  [28] A.C. Atkinson, In Plots, Transformations and Regression; Clarendon
validation  plots Press, Oxford, 1985.
[29] J.H. Friedman, Ann. Stat. 1991, 19, 1.
[30] C. Rucker, G. Rucker, M. Meringer, J. Chem. Inf. Model. 2007, 47, 2345.
[31] H. Zhu, A. Tropsha, D. Fourches, A. Varnek, E. Papa, P. Gramatica, T.
How to cite this article: P. Gramatica, N. Chirico, E. Papa, S.
Oberg, P. Dao, A. Cherkasov, I.V. Tetko, J. Chem. Inf. Model. 2008, 48,
Cassani, S. Kovarich. J. Comput. Chem. 2013, 34, 21212132. 766.
[32] B. Bhhatarai, W. Teetz, T. Liu, T. Oberg, N. Jeliazkova, N. Kochev, O.
DOI: 10.1002/jcc.23361 Pukalov, I. Tetko, S. Kovarich, E. Papa, P. Gramatica, Mol. Inf. 2011, 30,
189.
Additional Supporting Information may be found in the
[33] J. Li, B. Lei, H. Liu, S. Li, X. Yao, M. Liu, P. Gramatica, J. Comput. Chem.
online version of this article. 2008, 29, 2636.
[34] E. Papa, P. Gramatica, Green Chem. 2010, 12, 836.
[35] ECHA report. Available at: http://echa.europa.eu/documents/10162/
[1] Regulation (EC) No 1907/2006 of the European Parliament and of the 13655/pg_report_qsars_en.pdf, accessed April 8, 2013.
Council (18/12/2006) concerning REACH. Available at: http://ec.euro-
pa.eu/environment/chemicals/reach/reach_intro.htm, accessed on Feb-
ruary 13, 2013.
Received: 9 April 2013
[2] A. Tropsha, Mol. Inf. 2010, 29, 476. Revised: 3 June 2013
[3] D. Fourches, E. Muratov, A. Tropsha, J. Chem. Inf. Model. 2010, 50, Accepted: 4 June 2013
1189. Published online on 21 June 2013

2132 Journal of Computational Chemistry 2013, 34, 21212132 WWW.CHEMISTRYVIEWS.COM

Anda mungkin juga menyukai