and Modeling
Dr. Tom Dearing
CPAC, University of Washington
Outline
• Fundamentals of Chemometrics
– Introduction to Chemometrics
– Measurements
– The Data Analysis Procedure
• Basic Modeling
– Principal Component Analysis
– Scores and Loadings
• Advanced Modeling
– Partial Least Squares
– Latent Variables
– Scores and Loadings
– Calibration and Validation
– Prediction
• Case Study
Section 1
• Chemometrics is:
The science of extracting information from measurements
made on chemical systems with the use of mathematical and
statistical procedures.
6.5
Signal Intensity
– Spectroscopic 5
– Chromatographic 4
– Physical 3
600 800 1000 1200 1400
Wavenumber cm-1
1600 1800 2000
• Univariate • Multivariate
– One variable to measure – Multiple variables
– One variable to predict – Multiple predictions
OUTLIER
INPUT PREPROCESSING
REMOVAL
DATA ANALYSIS
OUTPUT
Input
• Most overlooked stage of data analysis.
• Most critical stage of all.
• Preprocessing
7
6.5
5.5
Signal Intensity
5
information.
3.5
3
600 800 1000 1200 1400 1600 1800 2000
Wavenumber cm-1
0.6
– Normalization 0.4
–
-0.2
–
-0.6
Scores on PC 2 (12.88%)
1
analysis. 0
– Section 2
-2
-3
-15 -10 -5 0 5 10
60
– Section 3 50
• MCR 40
30
• Neural Networks 20
10
0
4.65 4.7 4.75 4.8 4.85 4.9 4.95 5 5.05 5.1 5.15
Output
• Qualitative • Quantitative
– Classification models. – Prediction models
– Does a sample belong to – What is the
a group or not?? concentration of the
– Calibration and sample??
Validations – Calibration and
– Classifications Validations
– Classification error – Predictions
– Number of samples – Calibration and
classified correctly Prediction Errors
– RMSEC and RMSEP
Error
• Many different methods of calculating errors.
• Method used is critical as model quality
determined by the error.
• Procedure used can heavily influence model
errors. (Discussed later in PCA section).
6.5 6
0.05
6 4
Loadings on PC 1 (99.93%)
Scores on PC 2 (0.05%)
0.045
5.5
PCA 2
Signal Intensity
5 0 0.04
4.5 -2
0.035
4 -4
0.03
3.5 -6
3 -8 0.025
600 800 1000 1200 1400 1600 1800 2000 118 120 122 124 126 128 130 132 134 136 100 200 300 400 500 600
Wavenumber cm-1
Scores on PC 1 (99.93%) Variable
200
100 50
Scores on PC 1 (52.89%)
Scores on PC 2 (29.86%)
Scores on PC 3 (11.16%)
100
0 0
0
-100 -50
-100
-200 -100
-200
Scores on PC 3 (11.16%)
100
Scores on PC 2 (29.86%)
50 50
Scores on PC 3 (11.16%)
0 0
0
-50
-100
-50 -100
-200
-150
-100
-300
-200
-150 200
0
-400 -200
-400 -300 -200 -100 0 100 200 300 400 -400 -200 0 200
-200
Scores on PC 1 (52.89%) -400 -300 -200 -100 0 100 200 300 400 Scores on PC 2 (29.86%)
Scores on PC 1 (52.89%) Scores on PC 1 (52.89%)
0.06
0.05
0.05
Loadings on PC 1 (81.38%)
Loadings on PC 1 (99.93%)
0.045 0.04
0.03
0.04
0.02
0.035 0.01
0
0.03
-0.01
0.025 -0.02
100 200 300 400 500 600 100 200 300 400 500 600
Variable Variable
0.04
0.03
Loadings on PC 1 (62.61%)
0.02
0.01
-0.01
-0.02
-0.03
100 200 300 400 500 600
Variable
AUTO SCALING
Outlier Removal
4 6
3
4
2
Scores on PC 2 (12.88%)
Scores on PC 2 (12.88%)
2
1
0 0
-1
-2
-2
-4
-3
-4 -6
• Calibration Equation:
y = Xb
y is concentration data, X is spectra and b is the produced model.
CONCENTRATIONS PREPROCESSING
Y-Block
PLS PREDICTION
MODEL
CONCETRATIONS FOR
NEW MEASURMENTS
Difference between PLS and PCA
• PCA • PLS
• Classification • Quantification
• Exploratory analysis of • Prediction
data. • Modeling of current
• PC’s extracted describe and future samples.
sources of variation in • Latent variables
order of significance. important factor in
• Used for the removal of determining model
outliers performance.
Calibration
• Design of Experiments
– Use optimal methods to effectively map the data
– Methods such as D-Optimal, E-Optimal and Kennard-
Stone.
– These methods only need to be run once.
• Random Subsets
– Select a set of samples entirely at random.
– Perform analysis and calculate errors.
– Re-select a new random subset and repeat procedure
for a number of iterations
– Calculate average errors at the end.
Selecting Samples For Calibrations
• D-Optimal
• Kennard-Stone
Ideal Number of
Latent Variables RMSEP
for model.
Error
RMSEC
Number of Latent Variables
Model Maintenance
6.5
50
40
5.5
Number of Samples
Signal Intensity
5 30
4.5
20
10
3.5
3 0
600 800 1000 1200 1400 1600 1800 2000 4.65 4.7 4.75 4.8 4.85 4.9 4.95 5 5.05 5.1 5.15
-1 Tablet API Concentration
Wavenumber cm
The Problem
PCA
SCORES LOADINGS
NIR Data – Visual Inspection
7
6.5
VARIATION IN
6 BASELINE
5.5
DETECTOR
Signal Intensity
5 NOISE
4.5
3.5
3
600 800 1000 1200 1400 1600 1800 2000
-1
Wavenumber cm
7
Pre-processing 1
Mean Centred NIR Spectra
0.8
6.5
0.6
6
0.4
5 0
-0.2
4.5
-0.4
4
-0.6
3.5
-0.8
3 -1
600 800 1000 1200 1400 1600 1800 2000 600 800 1000 1200 1400 1600 1800 2000
-1
Wavenumber cm Wavenumber cm-1
2
all be the same
0
theoretically should form
-2 one group.
-4 • However 6 clusters
-6 formed.
-15 -10 -5 0 5
Scores on PC 1 (81.38%)
10 15
• Further investigation
found 6 different tablet
presses had been used.
Mean Centered Loadings
0.07
Variables/Loadings Plot • Loadings on PC1
0.06
show that the
0.05
0.04
0.03
contribute little
0.02
0.01
information or noise
0
to the scores.
-0.01
-0.02
100 200 300
Variable
400 500 600
• Spectra truncated at
variable 400, which is
1398cm-1
Scatter Correction
PCA
SCORES LOADINGS
Scatter Correction
UNCORRECTED SPECTRA SCATTER CORRECTED SPECTRA
7 7
6.5
6.5
6
6
5.5
Signal Intensity
Signal Intensity
5.5
5
4.5
4.5
4
4
3.5
3 3.5
600 700 800 900 1000 1100 1200 1300 1400 600 700 800 900 1000 1100 1200 1300 1400
-1
Wavenumber cm Wavenumber cm-1
New Scores
2 preprocessing the
new scores (red
Scores on PC 2 (11.24%)
1
triangles) have
0 formed one tight
-1
cluster showing that
variation not
-2
relating to the API
-3
concentration has
-15 -10 -5 0
Scores on PC 1 (88.42%)
5 10
been removed.
What Next?
CALIBRATION
PREPROCESSING
SPECTRA
INPUT SPECTRA
7
VALIDATION
PLS CALIBRATION
6.5
6 SPECTRA
MODEL
Signal Intensity
5.5
4.5
3.5 CALIBRATION
PREPROCESSING
600 700 800 900 1000 1100 1200 1300 1400
Wavenumber cm-1
CONCENTRATION
CONCENTRATIONS
5.05
VALIDATION
CONCENTRATION
5
4.95
API Concentration
4.9
4.85
4.8
4.75
4.7
50 100 150 200 250
Sample Number
PLS Calibration Model
fitting’. 4.85
concentration.
PLS Modeling Strategy
• Stage Two: Test Validate Calibration Model.
VALIDATION
PREPROCESSING
SPECTRA
PLS CALIBRATION VARY LATENT
MODEL VARIABLES
VALIDATION
CONCENTRATION PREPROCESSING
RMSEC
RMSEP
LV Model
0.065
0.06
0.055
0.05
Error
RMSEC
0.045
RMSEP
0.04
0.035
0.03
0 2 4 6 8 10 12 14 16
Number of Latent Variables
5.5
CONCENTRATION
4.75
5
Signal Intensity
PLS CALIBRATION
4.5 MODEL 4.79
4
4.9
3.5
600 700 800 900 1000 1100 1200 1300 1400
Wavenumber cm-1
Case Study Summary