Dearing Fundamentals of Chemo Metrics

Fundamentals of Chemometrics
and Modeling
Dr. Tom Dearing
CPAC, University of Washington
Outline
• Fundamentals of Chemometrics
– Introduction to Chemometrics
– Measurements
– The Data Analysis Procedure
• Basic Modeling
– Principal Component Analysis
– Scores and Loadings
• Advanced Modeling
– Partial Least Squares
– Latent Variables
– Scores and Loadings
– Calibration and Validation
– Prediction
• Case Study
Section 1
Through the looking glass…..

Chemometrics
• Chemometrics is:
The science of extracting information from measurements
made on chemical systems with the use of mathematical and
statistical procedures.
• Keywords and phrases:

data analysis, data processing, univariate, multivariate,
variance, modeling, scores, loadings, calibration and
validations, predictions, real time decision making.
Measurements 7
Near IR Tablet Data
6.5
• Measurements come in many 6
different forms. 5.5
Signal Intensity
– Spectroscopic 5
• Near IR, Fluorescence, Raman. 4.5
– Chromatographic 4
• Gas Chromatography, HPLC. 3.5
– Physical 3
600 800 1000 1200 1400
Wavenumber cm-1
1600 1800 2000
• Temperature, Pressure, Flow rate,

Melting Points, Viscosity,
Concentrations.
• All measurements yield data. Intensity (counts)
• NIR data set containing 255

spectra measured at 650
different wavenumbers has
165750 data points!! Wavelength (nm)
Two Types Of Data
• Univariate • Multivariate
– One variable to measure – Multiple variables
– One variable to predict – Multiple predictions
– Typically select one – Typically use entire

wavelength and monitor spectra.
change of absorbance – Allows investigation into
over time. the relationship
– Wavelength must not between variables.
have contributions or – Allows revealing of
overlapping from other latent variation within a
peaks. set of spectra.
Multivariate Analysis
• Analysis performed on multiple sets of

measurements, wavelengths, samples and
data sets.
• Analysis of variance and dependence between

variables in crucial to multivariate analysis.
The Chemometrics Process
• All chemometrics begin with taking
a measurement and collecting 5. Understanding
data.
• Mathematical and statistical
methods are employed to extract 4. Knowledge
relevant information from the data.

• The information is related to the
chemical process to extract 3. Information
knowledge about a system.

• Finally, the knowledge provided
2. Data
allows comprehension and
understanding of a system.
• Understanding facilitates decision 1. Measurement
making.
Converting Data to Information
• Advances in measurement science means rate

of data collection is extremely fast.
• Large amounts of data produced.
• Data rich, information poor.
• Chemometrics used to remove redundant
data, reduce variation not relating to the
analytical signal and build models.
Data Analysis Flow Chart
OUTLIER
INPUT PREPROCESSING
REMOVAL
DATA ANALYSIS
OUTPUT
Input
• Most overlooked stage of data analysis.
• Most critical stage of all.
• Data must be converted or transferred into

the analysis software.
• Proprietary collection software make this task
difficult.
• However, some analysis software have
excellent data importing functionality
Outliers – Problems and Removal
• Removing outliers is a delicate procedure.

• Grubbs test used to detect outliers.
• Frequently requires knowledge about the process
being examined.
• False outliers, samples at extremes of the system
that appear infrequently within the data.
– These are NOT REMOVED
• True outliers, samples or variable that is
statistically different from the other samples.
– These ARE REMOVED
Preprocessing
Near IR Tablet Data
• Preprocessing
7
6.5
– Main goal of the preprocessing

6
5.5
stage is to remove variation
Signal Intensity
5
within the data that does not 4.5
pertain to the analytical 4
information.
3.5
3
600 800 1000 1200 1400 1600 1800 2000
Wavenumber cm-1
• Typical preprocessing methods MEAN

CENTRING
– Baseline Correction Mean Centred NIR Spectra
– Mean Centering 0.8
0.6
– Normalization 0.4
Mean Centred Signal Intensity

0.2
– Orthogonal Signal Correction 0
–
-0.2
Multiplicative Scatter Correction -0.4
–
-0.6
Savitsky-Golay Derivatisation -0.8
600 800 1000 1200 1400 1600 1800

Wavenumber cm-1
Data Analysis
• Many different methods for
Scores Plot
3
performing multivariate data 2
Scores on PC 2 (12.88%)
1
analysis. 0
• Principal Component Analysis -1
– Section 2
-2
-3
-15 -10 -5 0 5 10
• Partial Least Squares

60
– Section 3 50
• MCR 40
30
• Neural Networks 20
10
0
4.65 4.7 4.75 4.8 4.85 4.9 4.95 5 5.05 5.1 5.15
Output
• Qualitative • Quantitative
– Classification models. – Prediction models
– Does a sample belong to – What is the
a group or not?? concentration of the
– Calibration and sample??
Validations – Calibration and
– Classifications Validations
– Classification error – Predictions
– Number of samples – Calibration and
classified correctly Prediction Errors
– RMSEC and RMSEP
Error
• Many different methods of calculating errors.
• Method used is critical as model quality
determined by the error.
• Procedure used can heavily influence model
errors. (Discussed later in PCA section).
• The choice of error metric depends on many

different factors
• Top Three
– What are you showing?
– What is the range of data?
– How many samples do you have?
Summary
• Chemometrics is a method of extracting relevant

information from complex chemical data.
• Multivariate data allows analysis robust
investigation of overlapping signals.
• Multivariate analysis allows investigation of the
relationship between variables.
• The chemometrics process yields understanding
and comprehension of the process under
investigation.
Summary
• Data analysis is a multistep procedure

involving many algorithms and many different
paths to go down.
• The end results of data analysis are commonly
a model that could provide qualitative or
quantitative information.
• MatLab and PLS_Toolbox are software
packages used to perform chemometrics
analysis.
Section 2
Principal Component Analysis

P.C.A.
PCA
• Method of reducing a set of data into three

new sets of variables
– Principal Components (PC’s)
– Scores
– Loadings
• Using these three new variables latent
variation can be developed and examined.
• Incredibly important for investigating the
relationships between samples and variables
PCA
• NIR spectra run through a PCA routine without any
form of preprocessing.
• Scores produced show apparent variation in
concentration.
• Loadings illustrate the mean spectra, suggesting that
preprocessing should be used.
Near IR Tablet Data Samples/Scores Plot Variables/Loadings Plot
7 8 0.055
6.5 6
0.05
6 4
Loadings on PC 1 (99.93%)
0.045
5.5
PCA 2
Signal Intensity
5 0 0.04
4.5 -2
0.035
4 -4
0.03
3.5 -6
3 -8 0.025
600 800 1000 1200 1400 1600 1800 2000 118 120 122 124 126 128 130 132 134 136 100 200 300 400 500 600
Wavenumber cm-1
Scores on PC 1 (99.93%) Variable
SPECTRAL DATA SCORES LOADINGS

Principal Components
• Each principal component calculated captures

as much of the variation within the data as
possible.
• This variation is removed and a new principal
component is determined.
• The first PC describes the greatest source of
variation within the data
Scores
• The scores are organized in a column fashion.

• The first column denotes the scores relating to
the variation captured on PC1.
• Intra-sample relationships can be observed by
plotting the scores from PC1 against PC2.
• This can be expanded to the scores of the first
three PC’s.
Scores
Samples/Scores Plot of aldat Samples/Scores Plot of aldat Samples/Scores Plot of aldat
400 300 150
300 200 100
200
100 50
100
0 0
0
-100 -50
-100
-200 -100
-200
-300 -300 -150
-400 -400 -200

5 10 15 20 25 30 35 40 45 50 55 5 10 15 20 25 30 35 40 45 50 55 5 10 15 20 25 30 35 40 45 50 55
Sample Sample Sample
Scores on PC1 Scores on PC2 Scores on PC3
Samples/Scores Plot of aldat

300 Samples/Scores Plot of aldat
150 Samples/Scores Plot of aldat
200
100
100
100
50 50
0 0
0
-50
-100
-50 -100
-200
-150
-100
-300
-200
-150 200
0
-400 -200
-400 -300 -200 -100 0 100 200 300 400 -400 -200 0 200
-200
Scores on PC 1 (52.89%) -400 -300 -200 -100 0 100 200 300 400 Scores on PC 2 (29.86%)
Scores on PC 1 (52.89%) Scores on PC 1 (52.89%)
Scores of PC1 vs. Scores of PC1 vs. Scores of

PC2 PC3 PC1 vs. PC2 vs PC3
Loadings
• Illustrate the weight or importance of each

variable within the original data.
• From loadings it is possible to see the most
significant variables.
• Loadings can be used to track the process of a
reaction e.g. monitor reactant consumption.
• Deduce variables responsible for the
clustering in the scores.
Loadings Variables/Loadings Plot Variables/Loadings Plot
0.055 0.07
0.06
0.05
0.05
0.045 0.04
0.03
0.04
0.02
0.035 0.01
0
0.03
-0.01
0.025 -0.02
100 200 300 400 500 600 100 200 300 400 500 600
Variable Variable
NO PREPROCESSING MEAN CENTRING

Variables/Loadings Plot
0.05
0.04
0.03
0.02
0.01
-0.01
-0.02
-0.03
100 200 300 400 500 600
Variable
AUTO SCALING
Outlier Removal
• PCA can be used in conjunction with

confidence intervals to identify outliers within
a set of data.
Samples/Scores Plot Samples/Scores Plot
4 6
3
4
2
2
1
0 0
-1
-2
-2
-4
-3
-4 -6
-15 -10 -5 0 5 10 -15 -10 -5 0 5 10 15

Scores on PC 1 (81.38%) Scores on PC 1 (81.38%)
95% Confidence Interval 99.9% Confidence Interval

Summary
• PCA used to decompose the data into scores and

loadings
• Scores reveal information about between sample
variation.
• Loadings tell us which variables from within the
original data contribute most to the scores.
• PCA can also be used to analyze and investigate
data to perform tasks such as outlier removal.
• PCA facilitates process understanding.
Section 3
Partial Least Squares

Inverse Calibration
• Calibration Equation:
y = Xb
y is concentration data, X is spectra and b is the produced model.
• Calibration requires only spectra and

calibration property, such as a concentration.
• Demanding strategy as assumption made
about errors.
• Requires good lab data.
PLS
• Partial Least Squares (PLS) is an extension of the PCA

method.
• PCA extracts PC’s describing the sources of variation
within the data.
• PLS takes the PC’s and correlates them with Y-Block
information to calculate Latent Variables (LV’s).
• Y-Block information is typically sample concentrations,
physical properties.
• PLS is a quantitative procedure and can be used to
model and predict y-block information for future
samples
The X- and Y-Block
• PLS uses X-Block and Y-Block information.

• X-Block tends to refer to spectra.
• Y-Block relates to the information you want to
predict, such as concentration or some
physical property.
• Y-Block data is normally collected offline in a
lab.
• Y-Block is often referred to as the reference
method.
PLS Data Analysis
INPUT SPECTRA PREPROCESSING
PLS CALIBRATION NEW

X-Block MEASUREMENT
MODEL
DATA
CONCENTRATIONS PREPROCESSING
Y-Block
PLS PREDICTION
MODEL
CONCETRATIONS FOR
NEW MEASURMENTS
Difference between PLS and PCA
• PCA • PLS
• Classification • Quantification
• Exploratory analysis of • Prediction
data. • Modeling of current
• PC’s extracted describe and future samples.
sources of variation in • Latent variables
order of significance. important factor in
• Used for the removal of determining model
outliers performance.
Calibration
• Building a calibration model, requires retaining as

much relevant variation as possible.
• Whilst removing as much irrelevant variation as
possible.
• Selecting calibration data VITAL to final
predictions.
• Use Design of Experiments (DoE) to effectively
map a data space or series of experiments.
• Quality of calibration determine by calculating
the Root Mean Square Error in Calibration
(RMSEC)
Selecting Samples For Calibrations
• Design of Experiments
– Use optimal methods to effectively map the data
– Methods such as D-Optimal, E-Optimal and Kennard-
Stone.
– These methods only need to be run once.
• Random Subsets
– Select a set of samples entirely at random.
– Perform analysis and calculate errors.
– Re-select a new random subset and repeat procedure
for a number of iterations
– Calculate average errors at the end.
• Visual depiction of data

DATA SET
• D-Optimal
• Samples selected according to D-Optimal

criteria.
• Kennard-Stone
• Samples selected in an attempt to uniformly

map the data.
Validation
• Validation data is used to check the predictive

performance of the model.
• Validation can be performed using subsets of the
calibration data (Cross Validation).
• Separate validation sets of data can be collected
(True Validation).
• Cross validation leads to overly positive results.
• Quality of validation calculated using the Root
Mean Square Error in Prediction (RMSEP).
• Quality of predictions determines quality of
model.
Modeling
• The quality of calibrations and validations can

vary significantly with the number of LV’s
included in the model.
• Too few and the model will make poor
predictions as there is insufficient information
in the calibration
• Too many and the model has become overly
focused and contains too much variation
making it not robust to small amounts of
variation.
Modeling
Ideal Number of
Latent Variables RMSEP
for model.
Error
RMSEC
Number of Latent Variables
Model Maintenance
• We’ve built the model: So what next?

MODEL
MAINTENANCE
• Collect lab data weekly to re-validate the model.
– Are model results within significant error?
– If not what do we do?
• Re-evaluate calibration samples
– Is the calibration model still relevant?
• Perform DoE to re-select more data.
• Check LV model to make sure appropriate LV’s being used.
• Continual improvement.
Summary
• PLS implements inverse calibration to

incorporate concentration information into a
model.
• Makes quantitative predictions of unseen
samples
• Requires calibration and validation
• Latent variables have significant effect on
model.
• Quality of model determined by prediction
and the RMSEP
Case Study
Model Building From Beginning to

End
Case Study 1
• Near IR spectra of tablets collected over a
period of 4 years.
• GC analysis of tablets showed active
pharmaceutical ingredient within specification
for all samples.
7 60
6.5
50
40
5.5
Number of Samples
Signal Intensity
5 30
4.5
20
10
3.5
3 0
600 800 1000 1200 1400 1600 1800 2000 4.65 4.7 4.75 4.8 4.85 4.9 4.95 5 5.05 5.1 5.15
-1 Tablet API Concentration
Wavenumber cm
The Problem
• The NIR calibration model produced has

determined 32% samples are out of
specification.
• The Plan: Use PCA to investigate and examine

the spectra to improve the NIR calibration.
Data Analysis Plan
NIR TABLET OUTLIER MEAN

DATA REMOVAL CENTRING
PCA
SCORES LOADINGS
NIR Data – Visual Inspection
7
6.5
VARIATION IN
6 BASELINE
5.5
DETECTOR
Signal Intensity
5 NOISE
4.5
3.5
3
600 800 1000 1200 1400 1600 1800 2000
-1
Wavenumber cm
7
Pre-processing 1
Mean Centred NIR Spectra
0.8
6.5
0.6
6
0.4
Mean Centred Signal Intensity

5.5
0.2
Signal Intensity
5 0
-0.2
4.5
-0.4
4
-0.6
3.5
-0.8
3 -1
600 800 1000 1200 1400 1600 1800 2000 600 800 1000 1200 1400 1600 1800 2000
-1
Wavenumber cm Wavenumber cm-1
• Data mean centered to reduce the magnitude

of some variables.
• After mean centering large peak between
1350cm-1 and 1700cm-1
Mean Centered Scores
Samples/Scores Plot
• Strange distribution of
6
scores.
4
• For samples that should
2
all be the same
0
theoretically should form
-2 one group.
-4 • However 6 clusters
-6 formed.
-15 -10 -5 0 5
10 15
• Further investigation
found 6 different tablet
presses had been used.
Mean Centered Loadings
0.07
Variables/Loadings Plot • Loadings on PC1
0.06
show that the
0.05
variables after 400

0.04
0.03
contribute little
0.02
0.01
information or noise
0
to the scores.
-0.01
-0.02
100 200 300
Variable
400 500 600
• Spectra truncated at
variable 400, which is
1398cm-1
Scatter Correction
• Investigation into the manufacturing

procedure reveal tablets made using different
presses.
• This cause minor variations in the tablet
depth.
• This altered the pathlength and scattering of
the NIR radiation.
• Preprocessing must be applied to minimize
the variation in the data due to the change in
tablet depth.
Data Analysis Plan 2
MULTIPLICATIVE
NIR TABLET VARIABLE MEAN
SCATTER
DATA REMOVAL CORRECTION CENTRING
PCA
SCORES LOADINGS
Scatter Correction
UNCORRECTED SPECTRA SCATTER CORRECTED SPECTRA
7 7
6.5
6.5
6
6
5.5
Signal Intensity
Signal Intensity
5.5
5
4.5
4.5
4
4
3.5
3 3.5
600 700 800 900 1000 1100 1200 1300 1400 600 700 800 900 1000 1100 1200 1300 1400
-1
Wavenumber cm Wavenumber cm-1
New Scores
Samples/Scores of Original and Scatter Corrected Data • After performing the

new stages of
3
2 preprocessing the
new scores (red
1
triangles) have
0 formed one tight
-1
cluster showing that
variation not
-2
relating to the API
-3
concentration has
-15 -10 -5 0
5 10
been removed.
What Next?
Partial Least Squares

PLS Modeling Strategy
• Stage One: Build calibration model
CALIBRATION
PREPROCESSING
SPECTRA
INPUT SPECTRA
7
VALIDATION
PLS CALIBRATION
6.5
6 SPECTRA
MODEL
Signal Intensity
5.5
4.5
3.5 CALIBRATION
PREPROCESSING
600 700 800 900 1000 1100 1200 1300 1400
Wavenumber cm-1
CONCENTRATION
CONCENTRATIONS
5.05
VALIDATION
CONCENTRATION
5
4.95
API Concentration
4.9
4.85
4.8
4.75
4.7
50 100 150 200 250
Sample Number
PLS Calibration Model
• Large number of Samples/Scores Plot
LV’s used to 5.15
produce the best 5.1
calibration model. 5.05
• Too many LV’s can

API Calibrated
4.95
cause ‘over- 4.9
fitting’. 4.85
• RMSEC = 0.03539 4.8
• Error of 0.723% of 4.75
the mean API 4.7

4.65 4.7 4.75 4.8 4.85 4.9 4.95
API Measured
5 5.05 5.1 5.15
concentration.
PLS Modeling Strategy
• Stage Two: Test Validate Calibration Model.
VALIDATION
PREPROCESSING
SPECTRA
PLS CALIBRATION VARY LATENT
MODEL VARIABLES
VALIDATION
CONCENTRATION PREPROCESSING
RMSEC
RMSEP
LV Model
0.065
0.06
0.055
0.05
Error
RMSEC
0.045
RMSEP
0.04
0.035
0.03
0 2 4 6 8 10 12 14 16
Number of Latent Variables
• Varying number of LV’s to use in the model,

lead to the conclusion that 7 LV’s will give the
best predictions.
PLS Validation Model
Samples/Scores Plot of Predicted v.s. Actual For API Concentration

• Using 7 LV’s the
5.05
validation data was
applied to the
5
calibration model to
4.95
determine the RMSEP.
• Sacrifice calibration to
API Predicted
4.9 ensure better

predictions
• RMSEC = 0.050381
4.85
4.8 • RMSEP = 0.053719

4.75
• Prediction error 1.087%
4.65 4.7 4.75 4.8 4.85 4.9 4.95
API Measured
5 5.05 5.1 5.15
of the mean API
concentration.
PLS Future Modeling Strategy
• Stage Three: Predict new samples.
NEW SPECTRA PLS CALIBRATION PREDICTED

MEASUREMENTS MODEL CONCENTRATIONS
5.5
CONCENTRATION
4.75
5
Signal Intensity
PLS CALIBRATION
4.5 MODEL 4.79
4
4.9
3.5
600 700 800 900 1000 1100 1200 1300 1400
Wavenumber cm-1
Case Study Summary
• PCA used to explore variation within the spectra

• Samples and variables selected for calibration.
• Scatter correction and mean centering used to
preprocess data.
• PLS model built and validated using calibration
and validation data.
• RMSEC and RMSEP calculated.
• Concentrations determined for new sample
measurements.
Acknowledgements

Dearing Fundamentals of Chemo Metrics

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Dearing Fundamentals of Chemo Metrics

Diunggah oleh

Hak Cipta:

Format Tersedia

Fundamentals of Chemometrics

Through the looking glass…..

• Keywords and phrases:

• Measurements come in many 6

different forms. 5.5

• Near IR, Fluorescence, Raman. 4.5

• Gas Chromatography, HPLC. 3.5

• Temperature, Pressure, Flow rate,

• All measurements yield data. Intensity (counts)

• NIR data set containing 255

– Typically select one – Typically use entire

• Analysis performed on multiple sets of

• Analysis of variance and dependence between

relevant information from the data.

knowledge about a system.

• Advances in measurement science means rate

• Data must be converted or transferred into

• Removing outliers is a delicate procedure.

– Main goal of the preprocessing

stage is to remove variation

within the data that does not 4.5

pertain to the analytical 4

• Typical preprocessing methods MEAN

– Baseline Correction Mean Centred NIR Spectra

– Mean Centering 0.8

Mean Centred Signal Intensity

– Orthogonal Signal Correction 0

Multiplicative Scatter Correction -0.4

Savitsky-Golay Derivatisation -0.8

600 800 1000 1200 1400 1600 1800

performing multivariate data 2

• Principal Component Analysis -1

• Partial Least Squares

• The choice of error metric depends on many

• Chemometrics is a method of extracting relevant

• Data analysis is a multistep procedure

Principal Component Analysis

• Method of reducing a set of data into three

SPECTRAL DATA SCORES LOADINGS

• Each principal component calculated captures

• The scores are organized in a column fashion.

300 200 100

-300 -300 -150

-400 -400 -200

Scores on PC1 Scores on PC2 Scores on PC3

Samples/Scores Plot of aldat

Scores of PC1 vs. Scores of PC1 vs. Scores of

• Illustrate the weight or importance of each

NO PREPROCESSING MEAN CENTRING

• PCA can be used in conjunction with

-15 -10 -5 0 5 10 -15 -10 -5 0 5 10 15

95% Confidence Interval 99.9% Confidence Interval

• PCA used to decompose the data into scores and

Partial Least Squares

• Calibration requires only spectra and

• Partial Least Squares (PLS) is an extension of the PCA

• PLS uses X-Block and Y-Block information.

INPUT SPECTRA PREPROCESSING

PLS CALIBRATION NEW

• Building a calibration model, requires retaining as

• Visual depiction of data

• Samples selected according to D-Optimal

• Samples selected in an attempt to uniformly