Anda di halaman 1dari 65

Fundamentals of Chemometrics

and Modeling
Dr. Tom Dearing
CPAC, University of Washington
Outline
• Fundamentals of Chemometrics
– Introduction to Chemometrics
– Measurements
– The Data Analysis Procedure
• Basic Modeling
– Principal Component Analysis
– Scores and Loadings
• Advanced Modeling
– Partial Least Squares
– Latent Variables
– Scores and Loadings
– Calibration and Validation
– Prediction
• Case Study
Section 1

Through the looking glass…..


Chemometrics

• Chemometrics is:
The science of extracting information from measurements
made on chemical systems with the use of mathematical and
statistical procedures.

• Keywords and phrases:


data analysis, data processing, univariate, multivariate,
variance, modeling, scores, loadings, calibration and
validations, predictions, real time decision making.
Measurements 7
Near IR Tablet Data

6.5

• Measurements come in many 6

different forms. 5.5

Signal Intensity
– Spectroscopic 5

• Near IR, Fluorescence, Raman. 4.5

– Chromatographic 4

• Gas Chromatography, HPLC. 3.5

– Physical 3
600 800 1000 1200 1400
Wavenumber cm-1
1600 1800 2000

• Temperature, Pressure, Flow rate,


Melting Points, Viscosity,
Concentrations.

• All measurements yield data. Intensity (counts)

• NIR data set containing 255


spectra measured at 650
different wavenumbers has
165750 data points!! Wavelength (nm)
Two Types Of Data

• Univariate • Multivariate
– One variable to measure – Multiple variables
– One variable to predict – Multiple predictions

– Typically select one – Typically use entire


wavelength and monitor spectra.
change of absorbance – Allows investigation into
over time. the relationship
– Wavelength must not between variables.
have contributions or – Allows revealing of
overlapping from other latent variation within a
peaks. set of spectra.
Multivariate Analysis

• Analysis performed on multiple sets of


measurements, wavelengths, samples and
data sets.

• Analysis of variance and dependence between


variables in crucial to multivariate analysis.
The Chemometrics Process
• All chemometrics begin with taking
a measurement and collecting 5. Understanding
data.
• Mathematical and statistical
methods are employed to extract 4. Knowledge

relevant information from the data.


• The information is related to the
chemical process to extract 3. Information

knowledge about a system.


• Finally, the knowledge provided
2. Data
allows comprehension and
understanding of a system.
• Understanding facilitates decision 1. Measurement
making.
Converting Data to Information

• Advances in measurement science means rate


of data collection is extremely fast.
• Large amounts of data produced.
• Data rich, information poor.
• Chemometrics used to remove redundant
data, reduce variation not relating to the
analytical signal and build models.
Data Analysis Flow Chart

OUTLIER
INPUT PREPROCESSING
REMOVAL

DATA ANALYSIS

OUTPUT
Input
• Most overlooked stage of data analysis.
• Most critical stage of all.

• Data must be converted or transferred into


the analysis software.
• Proprietary collection software make this task
difficult.
• However, some analysis software have
excellent data importing functionality
Outliers – Problems and Removal

• Removing outliers is a delicate procedure.


• Grubbs test used to detect outliers.
• Frequently requires knowledge about the process
being examined.
• False outliers, samples at extremes of the system
that appear infrequently within the data.
– These are NOT REMOVED
• True outliers, samples or variable that is
statistically different from the other samples.
– These ARE REMOVED
Preprocessing
Near IR Tablet Data

• Preprocessing
7

6.5

– Main goal of the preprocessing


6

5.5

stage is to remove variation

Signal Intensity
5

within the data that does not 4.5

pertain to the analytical 4

information.
3.5

3
600 800 1000 1200 1400 1600 1800 2000
Wavenumber cm-1

• Typical preprocessing methods MEAN


CENTRING

– Baseline Correction Mean Centred NIR Spectra

– Mean Centering 0.8

0.6

– Normalization 0.4

Mean Centred Signal Intensity


0.2

– Orthogonal Signal Correction 0


-0.2

Multiplicative Scatter Correction -0.4


-0.6

Savitsky-Golay Derivatisation -0.8

600 800 1000 1200 1400 1600 1800


Wavenumber cm-1
Data Analysis
• Many different methods for
Scores Plot
3

performing multivariate data 2

Scores on PC 2 (12.88%)
1

analysis. 0

• Principal Component Analysis -1

– Section 2
-2

-3
-15 -10 -5 0 5 10

• Partial Least Squares


Scores on PC 1 (81.38%)

60

– Section 3 50

• MCR 40

30

• Neural Networks 20

10

0
4.65 4.7 4.75 4.8 4.85 4.9 4.95 5 5.05 5.1 5.15
Output

• Qualitative • Quantitative
– Classification models. – Prediction models
– Does a sample belong to – What is the
a group or not?? concentration of the
– Calibration and sample??
Validations – Calibration and
– Classifications Validations
– Classification error – Predictions
– Number of samples – Calibration and
classified correctly Prediction Errors
– RMSEC and RMSEP
Error
• Many different methods of calculating errors.
• Method used is critical as model quality
determined by the error.
• Procedure used can heavily influence model
errors. (Discussed later in PCA section).

• The choice of error metric depends on many


different factors
• Top Three
– What are you showing?
– What is the range of data?
– How many samples do you have?
Summary

• Chemometrics is a method of extracting relevant


information from complex chemical data.
• Multivariate data allows analysis robust
investigation of overlapping signals.
• Multivariate analysis allows investigation of the
relationship between variables.
• The chemometrics process yields understanding
and comprehension of the process under
investigation.
Summary

• Data analysis is a multistep procedure


involving many algorithms and many different
paths to go down.
• The end results of data analysis are commonly
a model that could provide qualitative or
quantitative information.
• MatLab and PLS_Toolbox are software
packages used to perform chemometrics
analysis.
Section 2

Principal Component Analysis


P.C.A.
PCA

• Method of reducing a set of data into three


new sets of variables
– Principal Components (PC’s)
– Scores
– Loadings
• Using these three new variables latent
variation can be developed and examined.
• Incredibly important for investigating the
relationships between samples and variables
PCA
• NIR spectra run through a PCA routine without any
form of preprocessing.
• Scores produced show apparent variation in
concentration.
• Loadings illustrate the mean spectra, suggesting that
preprocessing should be used.
Near IR Tablet Data Samples/Scores Plot Variables/Loadings Plot
7 8 0.055

6.5 6
0.05

6 4

Loadings on PC 1 (99.93%)
Scores on PC 2 (0.05%)

0.045
5.5
PCA 2
Signal Intensity

5 0 0.04

4.5 -2
0.035

4 -4

0.03
3.5 -6

3 -8 0.025
600 800 1000 1200 1400 1600 1800 2000 118 120 122 124 126 128 130 132 134 136 100 200 300 400 500 600
Wavenumber cm-1
Scores on PC 1 (99.93%) Variable

SPECTRAL DATA SCORES LOADINGS


Principal Components

• Each principal component calculated captures


as much of the variation within the data as
possible.
• This variation is removed and a new principal
component is determined.
• The first PC describes the greatest source of
variation within the data
Scores

• The scores are organized in a column fashion.


• The first column denotes the scores relating to
the variation captured on PC1.
• Intra-sample relationships can be observed by
plotting the scores from PC1 against PC2.
• This can be expanded to the scores of the first
three PC’s.
Scores
Samples/Scores Plot of aldat Samples/Scores Plot of aldat Samples/Scores Plot of aldat
400 300 150

300 200 100

200
100 50
Scores on PC 1 (52.89%)

Scores on PC 2 (29.86%)

Scores on PC 3 (11.16%)
100
0 0
0
-100 -50
-100

-200 -100
-200

-300 -300 -150

-400 -400 -200


5 10 15 20 25 30 35 40 45 50 55 5 10 15 20 25 30 35 40 45 50 55 5 10 15 20 25 30 35 40 45 50 55
Sample Sample Sample

Scores on PC1 Scores on PC2 Scores on PC3

Samples/Scores Plot of aldat


300 Samples/Scores Plot of aldat
150 Samples/Scores Plot of aldat
200
100
100

Scores on PC 3 (11.16%)
100
Scores on PC 2 (29.86%)

50 50
Scores on PC 3 (11.16%)

0 0
0
-50
-100
-50 -100
-200
-150
-100

-300
-200
-150 200
0
-400 -200
-400 -300 -200 -100 0 100 200 300 400 -400 -200 0 200
-200
Scores on PC 1 (52.89%) -400 -300 -200 -100 0 100 200 300 400 Scores on PC 2 (29.86%)
Scores on PC 1 (52.89%) Scores on PC 1 (52.89%)

Scores of PC1 vs. Scores of PC1 vs. Scores of


PC2 PC3 PC1 vs. PC2 vs PC3
Loadings

• Illustrate the weight or importance of each


variable within the original data.
• From loadings it is possible to see the most
significant variables.
• Loadings can be used to track the process of a
reaction e.g. monitor reactant consumption.
• Deduce variables responsible for the
clustering in the scores.
Loadings Variables/Loadings Plot Variables/Loadings Plot
0.055 0.07

0.06
0.05
0.05

Loadings on PC 1 (81.38%)
Loadings on PC 1 (99.93%)

0.045 0.04

0.03
0.04
0.02

0.035 0.01

0
0.03
-0.01

0.025 -0.02
100 200 300 400 500 600 100 200 300 400 500 600
Variable Variable

NO PREPROCESSING MEAN CENTRING


Variables/Loadings Plot
0.05

0.04

0.03
Loadings on PC 1 (62.61%)

0.02

0.01

-0.01

-0.02

-0.03
100 200 300 400 500 600
Variable

AUTO SCALING
Outlier Removal

• PCA can be used in conjunction with


confidence intervals to identify outliers within
a set of data.
Samples/Scores Plot Samples/Scores Plot

4 6

3
4
2
Scores on PC 2 (12.88%)

Scores on PC 2 (12.88%)
2
1

0 0

-1
-2

-2
-4
-3

-4 -6

-15 -10 -5 0 5 10 -15 -10 -5 0 5 10 15


Scores on PC 1 (81.38%) Scores on PC 1 (81.38%)

95% Confidence Interval 99.9% Confidence Interval


Summary

• PCA used to decompose the data into scores and


loadings
• Scores reveal information about between sample
variation.
• Loadings tell us which variables from within the
original data contribute most to the scores.
• PCA can also be used to analyze and investigate
data to perform tasks such as outlier removal.
• PCA facilitates process understanding.
Section 3

Partial Least Squares


Inverse Calibration

• Calibration Equation:
y = Xb
y is concentration data, X is spectra and b is the produced model.

• Calibration requires only spectra and


calibration property, such as a concentration.
• Demanding strategy as assumption made
about errors.
• Requires good lab data.
PLS

• Partial Least Squares (PLS) is an extension of the PCA


method.
• PCA extracts PC’s describing the sources of variation
within the data.
• PLS takes the PC’s and correlates them with Y-Block
information to calculate Latent Variables (LV’s).
• Y-Block information is typically sample concentrations,
physical properties.
• PLS is a quantitative procedure and can be used to
model and predict y-block information for future
samples
The X- and Y-Block

• PLS uses X-Block and Y-Block information.


• X-Block tends to refer to spectra.
• Y-Block relates to the information you want to
predict, such as concentration or some
physical property.
• Y-Block data is normally collected offline in a
lab.
• Y-Block is often referred to as the reference
method.
PLS Data Analysis

INPUT SPECTRA PREPROCESSING

PLS CALIBRATION NEW


X-Block MEASUREMENT
MODEL
DATA

CONCENTRATIONS PREPROCESSING

Y-Block
PLS PREDICTION
MODEL

CONCETRATIONS FOR
NEW MEASURMENTS
Difference between PLS and PCA

• PCA • PLS
• Classification • Quantification
• Exploratory analysis of • Prediction
data. • Modeling of current
• PC’s extracted describe and future samples.
sources of variation in • Latent variables
order of significance. important factor in
• Used for the removal of determining model
outliers performance.
Calibration

• Building a calibration model, requires retaining as


much relevant variation as possible.
• Whilst removing as much irrelevant variation as
possible.
• Selecting calibration data VITAL to final
predictions.
• Use Design of Experiments (DoE) to effectively
map a data space or series of experiments.
• Quality of calibration determine by calculating
the Root Mean Square Error in Calibration
(RMSEC)
Selecting Samples For Calibrations

• Design of Experiments
– Use optimal methods to effectively map the data
– Methods such as D-Optimal, E-Optimal and Kennard-
Stone.
– These methods only need to be run once.
• Random Subsets
– Select a set of samples entirely at random.
– Perform analysis and calculate errors.
– Re-select a new random subset and repeat procedure
for a number of iterations
– Calculate average errors at the end.
Selecting Samples For Calibrations

• Visual depiction of data


DATA SET
Selecting Samples For Calibrations

• D-Optimal

• Samples selected according to D-Optimal


criteria.
Selecting Samples For Calibrations

• Kennard-Stone

• Samples selected in an attempt to uniformly


map the data.
Validation

• Validation data is used to check the predictive


performance of the model.
• Validation can be performed using subsets of the
calibration data (Cross Validation).
• Separate validation sets of data can be collected
(True Validation).
• Cross validation leads to overly positive results.
• Quality of validation calculated using the Root
Mean Square Error in Prediction (RMSEP).
• Quality of predictions determines quality of
model.
Modeling

• The quality of calibrations and validations can


vary significantly with the number of LV’s
included in the model.
• Too few and the model will make poor
predictions as there is insufficient information
in the calibration
• Too many and the model has become overly
focused and contains too much variation
making it not robust to small amounts of
variation.
Modeling

Ideal Number of
Latent Variables RMSEP
for model.
Error

RMSEC
Number of Latent Variables
Model Maintenance

• We’ve built the model: So what next?


MODEL
MAINTENANCE
• Collect lab data weekly to re-validate the model.
– Are model results within significant error?
– If not what do we do?
• Re-evaluate calibration samples
– Is the calibration model still relevant?
• Perform DoE to re-select more data.
• Check LV model to make sure appropriate LV’s being used.
• Continual improvement.
Summary

• PLS implements inverse calibration to


incorporate concentration information into a
model.
• Makes quantitative predictions of unseen
samples
• Requires calibration and validation
• Latent variables have significant effect on
model.
• Quality of model determined by prediction
and the RMSEP
Case Study

Model Building From Beginning to


End
Case Study 1
• Near IR spectra of tablets collected over a
period of 4 years.
• GC analysis of tablets showed active
pharmaceutical ingredient within specification
for all samples.
7 60

6.5
50

40
5.5

Number of Samples
Signal Intensity

5 30

4.5
20

10
3.5

3 0
600 800 1000 1200 1400 1600 1800 2000 4.65 4.7 4.75 4.8 4.85 4.9 4.95 5 5.05 5.1 5.15
-1 Tablet API Concentration
Wavenumber cm
The Problem

• The NIR calibration model produced has


determined 32% samples are out of
specification.

• The Plan: Use PCA to investigate and examine


the spectra to improve the NIR calibration.
Data Analysis Plan

NIR TABLET OUTLIER MEAN


DATA REMOVAL CENTRING

PCA

SCORES LOADINGS
NIR Data – Visual Inspection
7

6.5
VARIATION IN
6 BASELINE

5.5

DETECTOR
Signal Intensity

5 NOISE

4.5

3.5

3
600 800 1000 1200 1400 1600 1800 2000
-1
Wavenumber cm
7
Pre-processing 1
Mean Centred NIR Spectra

0.8
6.5

0.6
6
0.4

Mean Centred Signal Intensity


5.5
0.2
Signal Intensity

5 0

-0.2
4.5

-0.4
4
-0.6

3.5
-0.8

3 -1
600 800 1000 1200 1400 1600 1800 2000 600 800 1000 1200 1400 1600 1800 2000
-1
Wavenumber cm Wavenumber cm-1

• Data mean centered to reduce the magnitude


of some variables.
• After mean centering large peak between
1350cm-1 and 1700cm-1
Mean Centered Scores
Samples/Scores Plot
• Strange distribution of
6
scores.
4
• For samples that should
Scores on PC 2 (12.88%)

2
all be the same
0
theoretically should form
-2 one group.
-4 • However 6 clusters
-6 formed.
-15 -10 -5 0 5
Scores on PC 1 (81.38%)
10 15
• Further investigation
found 6 different tablet
presses had been used.
Mean Centered Loadings

0.07
Variables/Loadings Plot • Loadings on PC1
0.06
show that the
0.05

variables after 400


Loadings on PC 1 (81.38%)

0.04

0.03
contribute little
0.02

0.01
information or noise
0
to the scores.
-0.01

-0.02
100 200 300
Variable
400 500 600
• Spectra truncated at
variable 400, which is
1398cm-1
Scatter Correction

• Investigation into the manufacturing


procedure reveal tablets made using different
presses.
• This cause minor variations in the tablet
depth.
• This altered the pathlength and scattering of
the NIR radiation.
• Preprocessing must be applied to minimize
the variation in the data due to the change in
tablet depth.
Data Analysis Plan 2
MULTIPLICATIVE
NIR TABLET VARIABLE MEAN
SCATTER
DATA REMOVAL CORRECTION CENTRING

PCA

SCORES LOADINGS
Scatter Correction
UNCORRECTED SPECTRA SCATTER CORRECTED SPECTRA
7 7

6.5
6.5

6
6

5.5
Signal Intensity

Signal Intensity
5.5

5
4.5

4.5
4

4
3.5

3 3.5
600 700 800 900 1000 1100 1200 1300 1400 600 700 800 900 1000 1100 1200 1300 1400
-1
Wavenumber cm Wavenumber cm-1
New Scores

Samples/Scores of Original and Scatter Corrected Data • After performing the


new stages of
3

2 preprocessing the
new scores (red
Scores on PC 2 (11.24%)

1
triangles) have
0 formed one tight
-1
cluster showing that
variation not
-2
relating to the API
-3
concentration has
-15 -10 -5 0
Scores on PC 1 (88.42%)
5 10
been removed.
What Next?

Partial Least Squares


PLS Modeling Strategy
• Stage One: Build calibration model

CALIBRATION
PREPROCESSING
SPECTRA

INPUT SPECTRA

7
VALIDATION
PLS CALIBRATION
6.5

6 SPECTRA
MODEL
Signal Intensity

5.5

4.5

3.5 CALIBRATION
PREPROCESSING
600 700 800 900 1000 1100 1200 1300 1400
Wavenumber cm-1

CONCENTRATION

CONCENTRATIONS
5.05
VALIDATION
CONCENTRATION
5

4.95
API Concentration

4.9

4.85

4.8

4.75

4.7
50 100 150 200 250
Sample Number
PLS Calibration Model

• Large number of Samples/Scores Plot

LV’s used to 5.15

produce the best 5.1

calibration model. 5.05

• Too many LV’s can


API Calibrated
4.95

cause ‘over- 4.9

fitting’. 4.85

• RMSEC = 0.03539 4.8

• Error of 0.723% of 4.75

the mean API 4.7


4.65 4.7 4.75 4.8 4.85 4.9 4.95
API Measured
5 5.05 5.1 5.15

concentration.
PLS Modeling Strategy
• Stage Two: Test Validate Calibration Model.

VALIDATION
PREPROCESSING
SPECTRA
PLS CALIBRATION VARY LATENT
MODEL VARIABLES
VALIDATION
CONCENTRATION PREPROCESSING

RMSEC

RMSEP
LV Model
0.065

0.06

0.055

0.05
Error

RMSEC
0.045
RMSEP

0.04

0.035

0.03
0 2 4 6 8 10 12 14 16
Number of Latent Variables

• Varying number of LV’s to use in the model,


lead to the conclusion that 7 LV’s will give the
best predictions.
PLS Validation Model

Samples/Scores Plot of Predicted v.s. Actual For API Concentration


• Using 7 LV’s the
5.05
validation data was
applied to the
5
calibration model to
4.95
determine the RMSEP.
• Sacrifice calibration to
API Predicted

4.9 ensure better


predictions
• RMSEC = 0.050381
4.85

4.8 • RMSEP = 0.053719


4.75
• Prediction error 1.087%
4.65 4.7 4.75 4.8 4.85 4.9 4.95
API Measured
5 5.05 5.1 5.15
of the mean API
concentration.
PLS Future Modeling Strategy
• Stage Three: Predict new samples.

NEW SPECTRA PLS CALIBRATION PREDICTED


MEASUREMENTS MODEL CONCENTRATIONS

5.5

CONCENTRATION
4.75
5
Signal Intensity

PLS CALIBRATION
4.5 MODEL 4.79
4
4.9
3.5
600 700 800 900 1000 1100 1200 1300 1400
Wavenumber cm-1
Case Study Summary

• PCA used to explore variation within the spectra


• Samples and variables selected for calibration.
• Scatter correction and mean centering used to
preprocess data.
• PLS model built and validated using calibration
and validation data.
• RMSEC and RMSEP calculated.
• Concentrations determined for new sample
measurements.
Acknowledgements

Anda mungkin juga menyukai