Anda di halaman 1dari 1370

The Unscrambler® X v10.

3
User Manual

Version 1.0

CAMO SOFTWARE AS
Nedre Vollgate 8, N-0158, Oslo, NORWAY
Tel: (47) 223 963 00
Fax: (47) 223 963 22
E-mail : info@camo.com | www.camo.com

i
The Unscrambler X v10.3

Copyright

All intellectual property rights in this work belong to CAMO Software AS. The information contained in this work
must not be reproduced or distributed to others in any form or by any means, electronic or mechanical, for any
purpose, without the express written permission of CAMO Software AS. This document is provided on the
understanding that its use will be confined to the officers of the organization (whose name is stated on the front
cover of this document) who acquired it and that no part of its contents will be disclosed to third parties without
prior written consent of CAMO Software AS.
Copyright © 2014 CAMO Software AS. All Rights Reserved
All other trademarks and copyrights mentioned in the document are acknowledged and belong to their respective
owners.

Disclaimer

This document has been reviewed and quality assured for accuracy of content. Succeeding versions of this
document are subject to change without notice and will reflect changes made to subsequent software version.
It is the sole responsibility of the organization using this document to ensure all tests meet the criteria specified in
the test scripts. CAMO Software takes no responsibility for the end use of the product as this requires the
performance of suitable feasibility trials and performance qualification to ensure the software is fit for purpose for
its intended use.

ii
Table of Contents

Table of Contents
1. Welcome to The Unscrambler® X ................................................................................. 1

2. Support Resources........................................................................................................ 3

2.1. Support resources on our website............................................................................ 3

3. Overview ...................................................................................................................... 5

3.1. What is The Unscrambler® X? ................................................................................... 5


3.1.1 Multivariate analysis simplified ............................................................................................... 5
3.1.2 Make well-designed experimental plans ................................................................................. 5
3.1.3 Reformat, transform and plot data .......................................................................................... 6
3.1.4 Study variations among one group of variables ....................................................................... 6
3.1.5 Study relations between two groups of variables.................................................................... 7
3.1.6 Validate multivariate models with uncertainty testing ............................................................ 7
3.1.7 Estimate new, unknown response values ................................................................................ 8
3.1.8 Classify unknown samples ....................................................................................................... 8
3.1.9 Reveal groups of samples ........................................................................................................ 8
3.2. Principles of classification ......................................................................................... 8
3.2.1 Purposes of classification ......................................................................................................... 9
3.2.2 Classification methods ............................................................................................................. 9
3.2.3 Steps in SIMCA classification .................................................................................................. 11
3.2.4 Classifying new samples......................................................................................................... 11
3.2.5 Outcomes of a classification .................................................................................................. 11
3.2.6 Classification based on a regression model ........................................................................... 12
3.3. How to use help ...................................................................................................... 12
3.3.1 How to open the help documentation................................................................................... 12
3.3.2 Browsing the contents ........................................................................................................... 12
3.3.3 Searching the contents .......................................................................................................... 12
3.3.4 Typographic cues ................................................................................................................... 13
3.4. Principles of regression ........................................................................................... 13
3.4.1 What is regression?................................................................................................................ 13
3.4.2 Multiple Linear Regression (MLR) .......................................................................................... 15
3.4.3 Principal Component Regression (PCR) ................................................................................. 16
3.4.4 Partial Least Squares Regression (PLSR)................................................................................. 16
3.4.5 L-PLS Regression .................................................................................................................... 17
3.4.6 Support Vector Machine Regression (SVMR)......................................................................... 18
3.4.7 Calibration, validation and related samples........................................................................... 18
3.4.8 Main results of regression ..................................................................................................... 19
3.4.9 Making the right choice with regression methods................................................................. 21
3.4.10 How to interpret regression results ....................................................................................... 22
3.4.11 Guidelines for calibration of spectroscopic data ................................................................... 24

iii
The Unscrambler X v10.3

3.5. Demonstration video .............................................................................................. 28

4. Application Framework .............................................................................................. 29

4.1. User interface basics ............................................................................................... 29


4.2. Getting to know the user interface......................................................................... 30
4.2.1 Application window ............................................................................................................... 30
4.2.2 Workspace ............................................................................................................................. 31
4.2.3 Project navigator.................................................................................................................... 32
4.2.4 Project information ................................................................................................................ 32
4.2.5 Page tab bar ........................................................................................................................... 32
4.2.6 The menu bar ......................................................................................................................... 32
4.2.7 The toolbar ............................................................................................................................ 33
4.2.8 The status bar ........................................................................................................................ 33
4.2.9 Dialogs ................................................................................................................................... 33
4.2.10 Setting up the user environment ........................................................................................... 34
4.2.11 Getting help ........................................................................................................................... 34
4.3. Matrix editor basics ................................................................................................ 34
4.3.1 What is a matrix? ................................................................................................................... 35
4.3.2 Adding data matrices ............................................................................................................. 36
4.3.3 Altering data tables ................................................................................................................ 36
4.3.4 Using ranges........................................................................................................................... 37
4.3.5 Data types .............................................................................................................................. 38
4.3.6 Keeping versions of data ........................................................................................................ 39
4.3.7 Saving data ............................................................................................................................. 39
4.4. Using the project navigator .................................................................................... 40
4.4.1 About the project navigator................................................................................................... 40
4.4.2 Create a project ..................................................................................................................... 40
4.4.3 Items in a project ................................................................................................................... 41
4.4.4 Browse a project .................................................................................................................... 41
4.4.5 Managing items in a project .................................................................................................. 41
4.5. Register pretreatment ............................................................................................ 44
4.6. Save model for prediction, classification ................................................................ 44
4.7. Set Alarms ............................................................................................................... 46
4.7.1 Prediction: .............................................................................................................................. 46
4.7.2 Classification: ......................................................................................................................... 47
4.7.3 Projection:.............................................................................................................................. 47
4.7.4 Input: ..................................................................................................................................... 48
4.8. Set Components ...................................................................................................... 49
4.9. Set Bias and Slope ................................................................................................... 49
4.9.1 Algorithm ............................................................................................................................... 50
4.9.2 Menu option .......................................................................................................................... 50
4.9.3 Usage ..................................................................................................................................... 50

iv
Table of Contents

4.10. Login ........................................................................................................................ 51


4.10.1 Non-Compliance mode .......................................................................................................... 51
4.10.2 Compliance Mode .................................................................................................................. 53
4.11. File ........................................................................................................................... 54
4.11.1 File menu ............................................................................................................................... 54
4.11.2 File – Print… ........................................................................................................................... 55
4.12. Edit .......................................................................................................................... 57
4.12.1 Edit menu ............................................................................................................................... 57
4.12.2 Edit – Change data type – Category… .................................................................................... 65
4.12.3 Edit – Category Property… ..................................................................................................... 70
4.12.4 Edit – Fill................................................................................................................................. 71
4.12.5 Edit – Find and Replace .......................................................................................................... 72
4.12.6 Edit – Go To… ......................................................................................................................... 74
4.12.7 Edit – Insert – Category Variable… ......................................................................................... 75
4.12.8 Edit – Define Range… ............................................................................................................. 77
4.12.9 Edit – Reverse… ...................................................................................................................... 85
4.12.10 Edit – Group rows…................................................................................................................ 85
4.12.11 Edit – Sample grouping… ....................................................................................................... 86
4.12.12 Scalar and Vector ................................................................................................................... 87
4.12.13 Split Text Variable .................................................................................................................. 88
4.13. View ........................................................................................................................ 90
4.13.1 View menu ............................................................................................................................. 90
4.14. Insert ....................................................................................................................... 93
4.14.1 Insert menu ............................................................................................................................ 93
4.14.2 Insert – Duplicate Matrix… ..................................................................................................... 94
4.14.3 Insert – Data Matrix… ............................................................................................................ 95
4.14.4 Insert – Custom Layout… ....................................................................................................... 96
4.14.5 Insert – Data Compiler… ...................................................................................................... 100
4.15. Plot ........................................................................................................................ 103
4.15.1 Plot menu............................................................................................................................. 103
4.16. Tasks...................................................................................................................... 104
4.16.1 Tasks menu .......................................................................................................................... 104
4.17. Tools ...................................................................................................................... 106
4.17.1 Tools menu .......................................................................................................................... 106
4.17.2 Tools – Audit Trail… ............................................................................................................. 107
4.17.3 Tools – Matrix Calculator… .................................................................................................. 108
4.17.4 Tools – Options… ................................................................................................................. 111
4.17.5 Tools – Report… ................................................................................................................... 113
4.18. Help ....................................................................................................................... 115
4.18.1 Help menu ........................................................................................................................... 115
4.18.2 Help – Modify License… ....................................................................................................... 116
4.18.3 Help – User Setup… .............................................................................................................. 117

v
The Unscrambler X v10.3

5. Import ...................................................................................................................... 119

5.1. Importing data ...................................................................................................... 119


5.1.1 Supported data formats ....................................................................................................... 119
5.1.2 How to import data.............................................................................................................. 121
5.2. ASCII ...................................................................................................................... 122
5.2.1 ASCII (CSV, text) ................................................................................................................... 122
5.2.2 About ASCII, CSV and tabular text files ................................................................................ 122
5.2.3 File – Import Data – ASCII… .................................................................................................. 123
5.3. BRIMROSE ............................................................................................................. 125
5.3.1 Brimrose............................................................................................................................... 125
5.3.2 About Brimrose data files .................................................................................................... 126
5.3.3 File – Import Data – Brimrose… ........................................................................................... 126
5.4. Bruker.................................................................................................................... 128
5.4.1 OPUS from Bruker ................................................................................................................ 128
5.4.2 About Bruker (OPUS) instrument files ................................................................................. 129
5.4.3 File – Import Data – OPUS… ................................................................................................. 129
5.5. DataBase ............................................................................................................... 132
5.5.1 Databases............................................................................................................................. 132
5.5.2 About supported database interfaces ................................................................................. 133
5.5.3 File – Import Data – Database… ........................................................................................... 133
5.6. DeltaNu ................................................................................................................. 139
5.6.1 DeltaNu ................................................................................................................................ 139
5.6.2 About DeltaNu data files ...................................................................................................... 139
5.6.3 File – Import Data – DeltaNu… ............................................................................................. 139
5.7. Excel ...................................................................................................................... 142
5.7.1 Microsoft Excel spreadsheets .............................................................................................. 142
5.7.2 About Microsoft Excel spreadsheets ................................................................................... 143
5.7.3 File – Import Data – Excel… .................................................................................................. 143
5.8. GRAMS .................................................................................................................. 144
5.8.1 GRAMS from Thermo Scientific ........................................................................................... 144
5.8.2 About the GRAMS data format ............................................................................................ 144
5.8.3 File – Import Data – GRAMS… .............................................................................................. 145
5.9. GuidedWave.......................................................................................................... 148
5.9.1 CLASS-PA & SpectrOn from Guided Wave ........................................................................... 148
5.9.2 About Guided Wave CLASS-PA & SpectrOn data files .......................................................... 149
5.9.3 File – Import Data – CLASS-PA & SpectrOn… ....................................................................... 149
5.10. Import Interpolate ................................................................................................ 152
5.10.1 Interpolate functionality ...................................................................................................... 152
5.11. Indico..................................................................................................................... 155
5.11.1 Indico ................................................................................................................................... 155
5.11.2 About ASD Inc. Indico data files ........................................................................................... 155

vi
Table of Contents

5.11.3 File – Import Data – Indico… ................................................................................................ 156


5.12. JcampDX ................................................................................................................ 159
5.12.1 JCAMP-DX ............................................................................................................................ 159
5.12.2 About the JCAMP-DX file format.......................................................................................... 160
5.12.3 File – Import Data – JCAMP-DX… ......................................................................................... 160
5.12.4 JCAMP-DX file format reference .......................................................................................... 163
5.13. Konica_Minolta ..................................................................................................... 165
5.13.1 Konica_Minolta .................................................................................................................... 165
5.13.2 About Konica_Minolta data files .......................................................................................... 166
5.13.3 File – Import Data – Konica_Minolta… ................................................................................. 166
5.14. Matlab ................................................................................................................... 167
5.14.1 Matlab.................................................................................................................................. 167
5.14.2 About Matlab data files ....................................................................................................... 168
5.14.3 File – Import Data – Matlab… .............................................................................................. 168
5.15. MyInstrument ....................................................................................................... 169
5.15.1 MyInstrument ...................................................................................................................... 169
5.15.2 About the MyInstrument standard ...................................................................................... 169
5.15.3 File – Import Data – MyInstrument… ................................................................................... 170
5.16. NetCDF .................................................................................................................. 173
5.16.1 NetCDF ................................................................................................................................. 173
5.16.2 About the NetCDF file format .............................................................................................. 173
5.16.3 File – Import Data – NetCDF… .............................................................................................. 173
5.17. NSAS ...................................................................................................................... 174
5.17.1 NSAS..................................................................................................................................... 174
5.17.2 About the NSAS file format .................................................................................................. 174
5.17.3 File – Import Data – NSAS… ................................................................................................. 175
5.17.4 NSAS file format reference .................................................................................................. 177
5.18. Omnic .................................................................................................................... 179
5.18.1 OMNIC ................................................................................................................................. 179
5.18.2 About Thermo OMNIC data files .......................................................................................... 180
5.18.3 File – Import Data – OMNIC… .............................................................................................. 180
5.19. OPC........................................................................................................................ 183
5.19.1 OPC protocol ........................................................................................................................ 183
5.19.2 About the OPC protocol ....................................................................................................... 183
5.19.3 File – Import Data – OPC… ................................................................................................... 184
5.20. OSISoftPI ............................................................................................................... 185
5.20.1 PI .......................................................................................................................................... 185
5.20.2 About supported interfaces ................................................................................................. 185
5.20.3 File – Import Data – PI… ....................................................................................................... 185
5.21. PerkinElmer ........................................................................................................... 189
5.21.1 PerkinElmer.......................................................................................................................... 189
5.21.2 About PerkinElmer instrument files ..................................................................................... 190

vii
The Unscrambler X v10.3

5.21.3 File – Import Data – PerkinElmer… ...................................................................................... 190


5.22. PertenDX ............................................................................................................... 193
5.22.1 Perten-DX ............................................................................................................................. 193
5.22.2 About the Perten Instruments JCAMP-DX file format.......................................................... 194
5.22.3 File – Import Data – Perten-DX… ......................................................................................... 194
5.22.4 Perten-DX file format reference .......................................................................................... 197
5.23. RapID ..................................................................................................................... 199
5.23.1 RapID.................................................................................................................................... 199
5.23.2 About RapID data files ......................................................................................................... 199
5.23.3 File – Import Data – rap-ID… ................................................................................................ 199
5.24. U5Data .................................................................................................................. 202
5.24.1 U5 Data ................................................................................................................................ 202
5.24.2 About Unscrambler� 5.0 data files..................................................................................... 202
5.24.3 File – Import Data – U5 Data… ............................................................................................. 203
5.25. UnscFileReader ..................................................................................................... 204
5.25.1 The Unscrambler® 9.8 .......................................................................................................... 204
5.25.2 About The Unscrambler® 9.8 file formats ............................................................................ 205
5.25.3 File – Import Data – Unscrambler… ..................................................................................... 205
5.25.4 The Unscrambler® 9.x file format reference ........................................................................ 205
5.26. UnscramblerX........................................................................................................ 206
5.26.1 The Unscrambler® X ............................................................................................................. 206
5.26.2 About The Unscrambler® X file format ................................................................................ 207
5.26.3 File – Import Data – Unscrambler X… .................................................................................. 207
5.27. Varian .................................................................................................................... 208
5.27.1 Varian ................................................................................................................................... 208
5.27.2 About Varian data files ........................................................................................................ 208
5.27.3 File – Import Data – Varian… ............................................................................................... 209
5.28. VisioTec ................................................................................................................. 212
5.28.1 VisioTec ................................................................................................................................ 212
5.28.2 About VisioTec data files ...................................................................................................... 213
5.28.3 File – Import Data – VisioTec…............................................................................................. 213

6. Export ....................................................................................................................... 215

6.1. Exporting data ....................................................................................................... 215


6.1.1 Supported data formats ....................................................................................................... 215
6.1.2 How to export data .............................................................................................................. 215
6.2. AMO ...................................................................................................................... 215
6.2.1 Export models to ASCII......................................................................................................... 215
6.2.2 About the ASCII-MOD file format ........................................................................................ 215
6.2.3 File – Export – ASCII-MOD… ................................................................................................. 215
6.2.4 ASCII-MOD file format reference ......................................................................................... 216
6.3. ASCII ...................................................................................................................... 221

viii
Table of Contents

6.3.1 ASCII export ......................................................................................................................... 221


6.3.2 File – Export – ASCII…........................................................................................................... 222
6.4. DeltaNu ................................................................................................................. 223
6.4.1 DeltaNu ................................................................................................................................ 223
6.4.2 File – Export – DeltaNu… ...................................................................................................... 223
6.5. JCampDX ............................................................................................................... 224
6.5.1 JCAMP-DX export ................................................................................................................. 224
6.5.2 File – Export – JCAMP-DX… .................................................................................................. 224
6.6. Matlab ................................................................................................................... 226
6.6.1 Matlab export ...................................................................................................................... 226
6.6.2 File – Export – Matlab… ....................................................................................................... 226
6.7. NetCDF .................................................................................................................. 227
6.7.1 NetCDF export ..................................................................................................................... 227
6.7.2 File – Export – NetCDF…....................................................................................................... 227
6.8. UnscFileWriter ...................................................................................................... 229
6.8.1 Export models to The Unscrambler® v9.8 ............................................................................ 229
6.8.2 About The Unscrambler® file format ................................................................................... 229
6.8.3 File – Export – Unscrambler… .............................................................................................. 230

7. Plots.......................................................................................................................... 231

7.1. Line plot ................................................................................................................ 231


7.2. Bar plot.................................................................................................................. 232
7.3. Scatter plot............................................................................................................ 234
7.4. 3-D scatter plot ..................................................................................................... 236
7.5. Matrix plot ............................................................................................................ 243
7.6. Histogram plot ...................................................................................................... 247
7.7. Normal probability plot......................................................................................... 248
7.8. Multiple scatter plot ............................................................................................. 250
7.9. Tabular summary plots ......................................................................................... 252
7.10. Special plots .......................................................................................................... 253
7.11. Plotting results from several matrices .................................................................. 255
7.11.1 Why is it useful? ................................................................................................................... 255
7.11.2 How to do it? ....................................................................................................................... 257
7.12. Annotating plots ................................................................................................... 258
7.13. Create Range Menu .............................................................................................. 259
7.14. Plotting: The smart way to display numbers ........................................................ 260
7.14.1 Various plots ........................................................................................................................ 260
7.14.2 Customizing plots ................................................................................................................. 261
7.14.3 Actions on a plot .................................................................................................................. 261
7.14.4 Plots in analysis .................................................................................................................... 261

ix
The Unscrambler X v10.3

7.15. Kennard-Stone (KS) Sample Selection .................................................................. 263


7.16. Marking ................................................................................................................. 266
7.16.1 How to mark samples/variables .......................................................................................... 266
7.16.2 How to create a new range of samples or variables from the marked items ...................... 268
7.16.3 Recalculate with modifications on marked samples or/and variables ................................. 269
7.17. Point details .......................................................................................................... 270
7.18. Formatting of plots ............................................................................................... 271
7.19. Formatting of 3D plots .......................................................................................... 274
7.20. Plot – Response Surface… ..................................................................................... 278
7.21. Saving and copying a plot ..................................................................................... 279
7.21.1 Saving a plot ......................................................................................................................... 279
7.21.2 Copying plots ....................................................................................................................... 280
7.22. Scope: Select plot range........................................................................................ 282
7.23. Edit – Select Evenly Distributed Samples .............................................................. 283
7.24. Zooming and Rescaling ......................................................................................... 284
7.24.1 General options ................................................................................................................... 284
7.24.2 Special options ..................................................................................................................... 285
7.24.3 Resize plots .......................................................................................................................... 285

8. Design of Experiments.............................................................................................. 287

8.1. Experimental design.............................................................................................. 287


8.2. Introduction to Design of Experiments (DoE) ....................................................... 287
8.2.1 DoE basics ............................................................................................................................ 288
8.2.2 Investigation stages and design objectives .......................................................................... 289
8.2.3 Available designs in The Unscrambler® ............................................................................... 291
8.2.4 Types of variables in experimental design ........................................................................... 293
8.2.5 Designs for unconstrained screening situations .................................................................. 295
8.2.6 Designs for unconstrained optimization situations ............................................................. 299
8.2.7 Designs for constrained situations ....................................................................................... 302
8.2.8 Types of samples in experimental design ............................................................................ 315
8.2.9 Sample order in a design...................................................................................................... 319
8.2.10 Blocking................................................................................................................................ 319
8.2.11 Extending a design ............................................................................................................... 321
8.2.12 Building an efficient experimental strategy ......................................................................... 322
8.2.13 Analyze results from designed experiments ........................................................................ 323
8.2.14 Advanced topics for unconstrained situations ..................................................................... 330
8.2.15 Advanced topics for constrained situations ......................................................................... 331
8.3. Insert – Create design… ........................................................................................ 334
8.3.1 General buttons ................................................................................................................... 334
8.3.2 Start ..................................................................................................................................... 334
8.3.3 Define Variables ................................................................................................................... 336

x
Table of Contents

8.3.4 Choose the Design ............................................................................................................... 339


8.3.5 Design Details ...................................................................................................................... 341
8.3.6 Additional Experiments ........................................................................................................ 352
8.3.7 Randomization ..................................................................................................................... 355
8.3.8 Summary .............................................................................................................................. 357
8.3.9 Design Table ......................................................................................................................... 357
8.4. Tools – Modify/Extend Design… ........................................................................... 358
8.4.1 To remember ....................................................................................................................... 359
8.5. Tasks – Analyze – Analyze Design Matrix… ........................................................... 360
8.5.1 Order of the runs ................................................................................................................. 361
8.5.2 Level values .......................................................................................................................... 361
8.6. DoE analysis .......................................................................................................... 361
8.7. Analysis results...................................................................................................... 365
8.8. Interpreting design analysis plots ......................................................................... 366
8.8.1 Accessing plots ..................................................................................................................... 367
8.8.2 Available plots for Classical DoE Analysis (Scheffe and MLR) .............................................. 367
8.8.3 Available plots for Partial Least Squares Regression (DoE PLS) ........................................... 387
8.9. DOE method reference ......................................................................................... 394
8.10. Bibliography .......................................................................................................... 394

9. Validation ................................................................................................................. 397

9.1. Validation .............................................................................................................. 397


9.2. Introduction to validation ..................................................................................... 397
9.2.1 Principles of model validation .............................................................................................. 397
9.2.2 What is validation? .............................................................................................................. 398
9.2.3 Validation results ................................................................................................................. 400
9.2.4 When to use which validation method ................................................................................ 400
9.2.5 Uncertainty testing with cross validation ............................................................................ 401
9.2.6 More details about the uncertainty test .............................................................................. 402
9.2.7 Model validation check list .................................................................................................. 404
9.3. Validation tab ........................................................................................................ 405
9.3.1 Analysis and validation procedures ..................................................................................... 405
9.3.2 Validation methods .............................................................................................................. 406
9.3.3 How to display validation results ......................................................................................... 408
9.3.4 How to display uncertainty test results ............................................................................... 409
9.4. Validation tab – Cross validation setup….............................................................. 410

10. Transform ................................................................................................................. 413

10.1. Transformations .................................................................................................... 413


10.2. Baseline Correction ............................................................................................... 413
10.2.1 Baseline correction .............................................................................................................. 413

xi
The Unscrambler X v10.3

10.2.2 About baseline corrections .................................................................................................. 414


10.2.3 Tasks – Transform – Baseline ............................................................................................... 414
10.3. Center and Scale ................................................................................................... 416
10.3.1 Center_and_scale ................................................................................................................ 416
10.3.2 About centering ................................................................................................................... 416
10.3.3 Tasks – Transform – Center and Scale ................................................................................. 417
10.4. Compute General .................................................................................................. 419
10.4.1 Compute general ................................................................................................................. 419
10.4.2 About compute general ....................................................................................................... 420
10.4.3 Tasks – Transform – Compute_General… ............................................................................ 420
10.5. COW ...................................................................................................................... 423
10.5.1 Correlation Optimized Warping (COW) ............................................................................... 423
10.5.2 About correlation optimized warping .................................................................................. 424
10.5.3 Tasks – Transform – Correlation Optimized Warping… ....................................................... 425
10.6. Deresolv ................................................................................................................ 427
10.6.1 Deresolve ............................................................................................................................. 427
10.6.2 About deresolve ................................................................................................................... 428
10.6.3 Tasks – Transform – Deresolve ............................................................................................ 428
10.7. Derivatives ............................................................................................................ 429
10.7.1 Derivatives ........................................................................................................................... 429
10.7.2 About derivative methods and applications ........................................................................ 430
10.7.3 Gap Derivatives .................................................................................................................... 434
10.7.4 Gap Segment........................................................................................................................ 436
10.7.5 Savitzky Golay ...................................................................................................................... 438
10.8. Detrend ................................................................................................................. 440
10.8.1 Detrending ........................................................................................................................... 440
10.8.2 About detrending ................................................................................................................. 440
10.8.3 Tasks – Transform – Detrending .......................................................................................... 442
10.9. EMSC ..................................................................................................................... 443
10.9.1 MSC/EMSC ........................................................................................................................... 443
10.9.2 About multiplicative scatter correction ............................................................................... 444
10.9.3 Tasks – Transform – MSC/EMSC .......................................................................................... 445
10.10. Interaction and Square Effects .................................................................... 451
10.10.1 Interaction_and_Square_Effects ......................................................................................... 451
10.10.2 About interactions and square effects ................................................................................. 451
10.10.3 Tasks – Transform – Interactions and Square Effects .......................................................... 452
10.11. Interpolate ................................................................................................... 453
10.11.1 Interpolation ........................................................................................................................ 453
10.11.2 About interpolation ............................................................................................................. 453
10.11.3 Tasks – Transform – Interpolate .......................................................................................... 454
10.12. Missing Value Imputation ............................................................................ 455
10.12.1 Fill missing values................................................................................................................. 455

xii
Table of Contents

10.12.2 About fill missing values ...................................................................................................... 455


10.12.3 Tasks – Transform – Fill Missing… ........................................................................................ 456
10.13. Noise ............................................................................................................ 457
10.13.1 Noise .................................................................................................................................... 457
10.13.2 About adding noise .............................................................................................................. 457
10.13.3 Tasks – Transform – Noise ................................................................................................... 457
10.14. Normalize ..................................................................................................... 459
10.14.1 Normalization ...................................................................................................................... 459
10.14.2 About normalization ............................................................................................................ 460
10.14.3 Tasks – Transform – Normalize ............................................................................................ 462
10.15. OSC ............................................................................................................... 466
10.15.1 Orthogonal Signal Correction (OSC) ..................................................................................... 466
10.15.2 About Orthogonal Signal Correction (OSC) .......................................................................... 466
10.15.3 Tasks – Transform – OSC… ................................................................................................... 467
10.16. Quantile Normalize ...................................................................................... 470
10.16.1 Quantile Normalization ........................................................................................................ 470
10.16.2 About quantile normalization .............................................................................................. 470
10.16.3 Tasks – Transform – Quantile_Normalize ............................................................................ 471
10.17. Reduce Average ........................................................................................... 472
10.17.1 Reduce (Average) ................................................................................................................. 472
10.17.2 About averaging ................................................................................................................... 473
10.17.3 Tasks – Transform – Reduce (Average)… ............................................................................. 473
10.18. Smoothing .................................................................................................... 474
10.18.1 Smoothing methods ............................................................................................................. 474
10.18.2 Comparison of moving average and Gaussian filters ........................................................... 474
10.18.3 Gaussian Filter ..................................................................................................................... 475
10.18.4 Median Filter........................................................................................................................ 476
10.18.5 Moving Average ................................................................................................................... 478
10.18.6 Robust LOWESS.................................................................................................................... 479
10.18.7 Savitzky Golay ...................................................................................................................... 481
10.19. Spectroscopic Transformations ................................................................... 483
10.19.1 Spectroscopic transformations ............................................................................................ 483
10.19.2 About spectroscopic transformations.................................................................................. 484
10.19.3 Tasks – Transform – Spectroscopic… ................................................................................... 484
10.20. Standard Normal Variate ............................................................................. 486
10.20.1 Standard_Normal_Variate (SNV) ......................................................................................... 486
10.20.2 About Standard_Normal_Variate (SNV) .............................................................................. 487
10.20.3 Tasks – Transform – SNV ...................................................................................................... 487
10.21. Transpose..................................................................................................... 488
10.21.1 Transposition ....................................................................................................................... 488
10.21.2 Tasks – Transform – Transpose ............................................................................................ 488
10.22. Weighted Direct Standardization ................................................................ 489

xiii
The Unscrambler X v10.3

10.22.1 Weighted_Direct_Standardization (WDS) ........................................................................... 489


10.22.2 About Weighted_Direct_Standardization ............................................................................ 489
10.22.3 Tasks – Transform – Weighted_Direct_Standardization ...................................................... 489
10.23. Weights ........................................................................................................ 489
10.23.1 Weights ................................................................................................................................ 489
10.23.2 About weighting and scaling ................................................................................................ 490
10.23.3 Tasks – Transform – Weights… ............................................................................................ 492

11. Univariate Statistics .................................................................................................. 497

11.1. Descriptive statistics ............................................................................................. 497


11.2. Introduction to descriptive statistics .................................................................... 497
11.2.1 Purposes .............................................................................................................................. 497
11.2.2 The normal distribution ....................................................................................................... 498
11.2.3 Measures of central tendency ............................................................................................. 499
11.2.4 Measures of dispersion ........................................................................................................ 499
11.3. Tasks – Analyze – Descriptive Statistics… ............................................................. 501
11.3.1 Data input ............................................................................................................................ 501
11.3.2 Some important tips regarding the data input dialog .......................................................... 501
11.4. Interpreting descriptive statistics plots ................................................................ 502
11.4.1 Predefined descriptive statistics plots ................................................................................. 502
11.4.2 Plots accessible from the Statistics plot menu ..................................................................... 504
11.5. Descriptive statistics method reference ............................................................... 508
11.6. Bibliography .......................................................................................................... 508

12. Basic Statistical Tests ................................................................................................ 509

12.1. Statistical tests ...................................................................................................... 509


12.2. Introduction to statistical tests ............................................................................. 509
12.2.1 What are inferential statistics? ............................................................................................ 510
12.2.2 Hypothesis testing ............................................................................................................... 510
12.2.3 Tests for normality of data................................................................................................... 512
12.2.4 Tests for the equivalence of variances ................................................................................. 513
12.2.5 Tests for the comparison of means ..................................................................................... 515
12.2.6 Comparison of categorical data ........................................................................................... 517
12.3. Tasks – Analyze – Statistical Tests… ...................................................................... 518
12.4. Interpreting plots for statistical tests ................................................................... 523
12.4.1 Predefined plots for statistical tests .................................................................................... 524
12.5. Statistical tests method reference ........................................................................ 526
12.6. Bibliography .......................................................................................................... 526

13. Principal Components Analysis ................................................................................ 527

xiv
Table of Contents

13.1. Principal Component Analysis (PCA) ..................................................................... 527


13.2. Introduction to Principal Component Analysis (PCA) ........................................... 527
13.2.1 Exploratory data analysis ..................................................................................................... 528
13.2.2 What is PCA? ........................................................................................................................ 528
13.2.3 Purposes of PCA ................................................................................................................... 528
13.2.4 How PCA works in short ....................................................................................................... 529
13.2.5 Main result outputs of PCA .................................................................................................. 533
13.2.6 How to interpret PCA results ............................................................................................... 536
13.2.7 PCA rotation ......................................................................................................................... 539
13.2.8 PCA algorithm options ......................................................................................................... 542
13.3. Tasks – Analyze – Principal Component Analysis… ............................................... 542
13.3.1 Model Inputs tab ................................................................................................................. 543
13.3.2 Weights tab .......................................................................................................................... 544
13.3.3 Validation tab....................................................................................................................... 546
13.3.4 Rotation tab ......................................................................................................................... 547
13.3.5 Algorithm tab ....................................................................................................................... 548
13.3.6 Autopretreatment tab ......................................................................................................... 550
13.3.7 Set Alarms tab ...................................................................................................................... 551
13.3.8 Warning Limits tab ............................................................................................................... 551
13.4. Interpreting PCA plots........................................................................................... 553
13.4.1 Predefined PCA plots ........................................................................................................... 554
13.4.2 Plots accessible from the PCA plot menu ............................................................................ 571
13.5. PCA method reference .......................................................................................... 582
13.6. Bibliography .......................................................................................................... 582

14. Multiple Linear Regression ....................................................................................... 583

14.1. Multiple Linear Regression ................................................................................... 583


14.2. Introduction to Multiple Linear Regression (MLR) ............................................... 583
14.2.1 Basics ................................................................................................................................... 583
14.2.2 Principles behind Multiple Linear Regression (MLR)............................................................ 585
14.2.3 Interpreting the results of MLR ............................................................................................ 586
14.2.4 More details about regression methods .............................................................................. 589
14.3. Tasks – Analyze – Multiple Linear Regression ...................................................... 589
14.3.1 Model Inputs tab ................................................................................................................. 589
14.3.2 Validation tab....................................................................................................................... 591
14.3.3 Autopretreatments tab ........................................................................................................ 594
14.3.4 Set Alarms tab ...................................................................................................................... 594
14.3.5 Warning Limits tab ............................................................................................................... 595
14.3.6 Variable weighting in MLR ................................................................................................... 596
14.4. Interpreting MLR plots .......................................................................................... 597
14.4.1 Predefined MLR plots........................................................................................................... 598
14.4.2 Plots accessible from the MLR Plot menu ............................................................................ 610

xv
The Unscrambler X v10.3

14.5. MLR method reference ......................................................................................... 616


14.6. Bibliography .......................................................................................................... 616

15. Principal Components Regression ............................................................................ 617

15.1. Principal Component Regression .......................................................................... 617


15.2. Introduction to Principal Component Regression (PCR) ....................................... 617
15.2.1 Basics ................................................................................................................................... 617
15.2.2 Interpreting the results of a Principal Component Regression (PCR) .................................. 618
15.2.3 Some more theory of PCR .................................................................................................... 620
15.2.4 PCR algorithm options ......................................................................................................... 620
15.3. Tasks – Analyze – Principal Component Regression ............................................. 621
15.3.1 Model Inputs tab ................................................................................................................. 621
15.3.2 Weights tabs ........................................................................................................................ 623
15.3.3 Validation tab....................................................................................................................... 625
15.3.4 Algorithm tab ....................................................................................................................... 626
15.3.5 Autopretreatment tab ......................................................................................................... 628
15.3.6 Set Alarms tab ...................................................................................................................... 629
15.3.7 Warning Limits tab ............................................................................................................... 629
15.4. Interpreting PCR plots ........................................................................................... 631
15.4.1 Predefined PCR plots ........................................................................................................... 634
15.4.2 Plots accessible from the PCR plot menu ............................................................................. 658
15.5. PCR method reference .......................................................................................... 673
15.6. Bibliography .......................................................................................................... 673

16. Partial Least Squares ................................................................................................ 675

16.1. Partial Least Squares regression ........................................................................... 675


16.2. Introduction to Partial Least Squares Regression (PLSR) ...................................... 675
16.2.1 Basics ................................................................................................................................... 675
16.2.2 Interpreting the results of a PLS regression ......................................................................... 676
16.2.3 Scores and loadings (in general) .......................................................................................... 677
16.2.4 More details about regression methods .............................................................................. 680
16.2.5 PLSR algorithm options ........................................................................................................ 681
16.3. Tasks – Analyze – Partial Least Squares Regression ............................................. 682
16.3.1 Model Inputs tab ................................................................................................................. 682
16.3.2 Weights tabs ........................................................................................................................ 684
16.3.3 Validation tab....................................................................................................................... 686
16.3.4 Algorithm tab ....................................................................................................................... 687
16.3.5 Autopretreatments tab ........................................................................................................ 689
16.3.6 Set Alarms tab ...................................................................................................................... 690
16.3.7 Warning Limits tab ............................................................................................................... 690
16.4. Interpreting PLS plots............................................................................................ 692

xvi
Table of Contents

16.4.1 Predefined PLS plots ............................................................................................................ 695


16.4.2 Plots accessible from the PLS plot menu ............................................................................. 726
16.5. PLS method reference........................................................................................... 742
16.6. Bibliography .......................................................................................................... 742

17. LPLS .......................................................................................................................... 743

17.1. L-PLS regression .................................................................................................... 743


17.2. Introduction to L-PLS ............................................................................................ 743
17.2.1 Basics ................................................................................................................................... 743
17.2.2 The L-PLS model ................................................................................................................... 744
17.2.3 L-PLS by example ................................................................................................................. 745
17.3. Tasks – Analyze – L-PLS Regression ...................................................................... 746
17.3.1 Model inputs ........................................................................................................................ 746
17.3.2 X weights .............................................................................................................................. 748
17.3.3 Y weights .............................................................................................................................. 750
17.3.4 Z weights .............................................................................................................................. 750
17.4. Interpreting L-PLS plots......................................................................................... 751
17.4.1 Predefined L-PLS plots ......................................................................................................... 751
17.4.2 Plots accessible from the L-PLS menu .................................................................................. 758
17.5. L-PLS method reference ........................................................................................ 758
17.6. Bibliography .......................................................................................................... 758

18. Support Vector Machine Regression ........................................................................ 759

18.1. Support Vector Machine Regression (SVMR) ....................................................... 759


18.2. Introduction to Support Vector Machine (SVM) Regression (SVMR) ................... 759
18.2.1 Principles of Support Vector Machine (SVM) regression ..................................................... 759
18.2.2 What is SVM regression? ..................................................................................................... 760
18.2.3 Data suitable for SVM Regression ........................................................................................ 761
18.2.4 Main results of SVM regression ........................................................................................... 762
18.2.5 More details about SVM Regression .................................................................................... 763
18.3. Tasks – Analyze – Support Vector Machine Regression… ..................................... 763
18.3.1 Model input ......................................................................................................................... 763
18.3.2 Options ................................................................................................................................ 765
18.3.3 Grid Search........................................................................................................................... 768
18.3.4 Weights ................................................................................................................................ 768
18.3.5 Validation ............................................................................................................................. 770
18.4. Tasks – Predict – SVR Prediction… ........................................................................ 772
18.5. Interpreting SVM Regression results .................................................................... 773
18.5.1 Support vectors.................................................................................................................... 774
18.5.2 Parameters........................................................................................................................... 774
18.5.3 Probabilities ......................................................................................................................... 774

xvii
The Unscrambler X v10.3

18.5.4 Diagnostics ........................................................................................................................... 775


18.5.5 Prediction ............................................................................................................................. 775
18.5.6 Prediction plot ..................................................................................................................... 775
18.5.7 Predicted values after appplying the SVM model on new samples ..................................... 776
18.6. SVM method reference ......................................................................................... 776
18.7. Bibliography .......................................................................................................... 777

19. Multivariate Curve Resolution.................................................................................. 779

19.1. Multivariate Curve Resolution (MCR) ................................................................... 779


19.2. Introduction to Multivariate Curve Resolution (MCR).......................................... 779
19.2.1 MCR basics ........................................................................................................................... 780
19.2.2 Ambiguities and constraints in MCR .................................................................................... 782
19.2.3 MCR and 3-D data ................................................................................................................ 785
19.2.4 Algorithm implemented in The Unscrambler®: Alternating Least Squares (MCR-ALS) ........ 786
19.2.5 Main results of MCR ............................................................................................................ 788
19.2.6 Quality check in MCR ........................................................................................................... 789
19.2.7 MCR application examples ................................................................................................... 790
19.3. Tasks – Analyze – Multivariate Curve Resolution… .............................................. 791
19.3.1 Model Inputs ........................................................................................................................ 791
19.3.2 Options ................................................................................................................................ 792
19.4. Interpreting MCR plots ......................................................................................... 793
19.4.1 Predefined MCR plots .......................................................................................................... 794
19.5. MCR method reference ........................................................................................ 797
19.6. Bibliography .......................................................................................................... 797

20. Hierarchical Modeling .............................................................................................. 799

20.1. Hierarchical Modeling ........................................................................................... 799


20.2. Introduction to Hierarchical Modeling ................................................................. 799
20.2.1 Overall workflow.................................................................................................................. 799
20.2.2 Setup .................................................................................................................................... 800
20.2.3 Expected Scenarios .............................................................................................................. 800
20.3. Tasks – Analyze – Hierarchical Modeling .............................................................. 804
20.3.1 Defining actions ................................................................................................................... 805
20.3.2 Setting up a hierarchical model ........................................................................................... 811
20.3.3 Modifying an existing hierarchical model ............................................................................ 819
20.4. Prediction with Hierarchical Model ...................................................................... 819
20.5. Interpretation of results........................................................................................ 820

21. Segmented Correlation Outlier Analysis................................................................... 823

21.1. Segmented Correlation Outlier Analysis (SCA) ..................................................... 823

xviii
Table of Contents

21.2. Introduction to Segmented Correlation Outlier Analysis (SCA) ............................ 823


21.3. Tasks – Analyze – Segmented Correlation Outlier Analysis… ............................... 826
21.4. Tasks - Predict - Conformity… ............................................................................... 829
21.5. SCA Conformity Prediction Plots........................................................................... 830
21.5.1 Predefined prediction plots ................................................................................................. 830
21.6. Save model for SCA Conformity Prediction .......................................................... 832
21.7. Interpreting SCA plots ........................................................................................... 833
21.7.1 Predefined SCA plots ........................................................................................................... 834
21.8. SCA method reference .......................................................................................... 843

22. Instrument Diagnostics............................................................................................. 845

22.1. Instrument Diagnostics ......................................................................................... 845


22.2. Introduction to Instrument Diagnostics................................................................ 845
22.2.1 RMS Noise ............................................................................................................................ 845
22.2.2 Peak Height/Peak Area (Peak Model) .................................................................................. 846
22.2.3 Peak Position........................................................................................................................ 846
22.2.4 Loss of Intensity ................................................................................................................... 847
22.2.5 PCA Projection ..................................................................................................................... 847
22.3. Tasks – Analyze – Instrument Diagnostics ............................................................ 847
22.3.1 Main Dialog .......................................................................................................................... 847
22.3.2 Add Model ........................................................................................................................... 848
22.3.3 RMS Noise ............................................................................................................................ 849
22.3.4 Peak Model .......................................................................................................................... 851
22.3.5 Peak Position........................................................................................................................ 854
22.3.6 Single Loss of Intensity Model ............................................................................................. 857
22.3.7 Principal Component Analysis Models ................................................................................. 858
22.4. Prediction with Instrument Diagnostics Model .................................................... 861

23. Spectral Diagnostics.................................................................................................. 865

23.1. Spectral Diagnostics .............................................................................................. 865


23.2. Introduction to Spectral Diagnostics .................................................................... 865
23.2.1 RMS Noise ............................................................................................................................ 865
23.2.2 Peak Height/Peak Area (Peak Model) .................................................................................. 866
23.2.3 Peak Position........................................................................................................................ 866
23.2.4 Loss of Intensity ................................................................................................................... 867
23.2.5 PCA Projection ..................................................................................................................... 867
23.3. Tasks – Analyze – Spectral Diagnostics ................................................................. 867
23.3.1 Main Dialog .......................................................................................................................... 867
23.3.2 Add Model ........................................................................................................................... 868
23.3.3 RMS Noise ............................................................................................................................ 869
23.3.4 Peak Model .......................................................................................................................... 871

xix
The Unscrambler X v10.3

23.3.5 Peak Position........................................................................................................................ 874


23.3.6 Single Loss of Intensity Model ............................................................................................. 876
23.3.7 Principal Component Analysis Models ................................................................................. 878
23.4. Prediction with Spectral Diagnostics Model ......................................................... 880

24. Cluster Analysis ........................................................................................................ 883

24.1. Cluster analysis ..................................................................................................... 883


24.2. Introduction to cluster analysis ............................................................................ 883
24.2.1 Basics ................................................................................................................................... 883
24.2.2 Principles of cluster analysis ................................................................................................ 884
24.2.3 Nonhierarchical clustering ................................................................................................... 884
24.2.4 Hierarchical clustering ......................................................................................................... 884
24.2.5 Quality of the clustering ...................................................................................................... 887
24.2.6 Main results of cluster analysis ............................................................................................ 888
24.3. Tasks – Analyze – Cluster Analysis… ..................................................................... 888
24.3.1 Inputs ................................................................................................................................... 889
24.3.2 Options for K-means/K-median clustering ........................................................................... 889
24.3.3 Results.................................................................................................................................. 891
24.4. Interpreting cluster analysis plots......................................................................... 892
24.4.1 Dendrogram ......................................................................................................................... 892
24.5. Cluster analysis method reference ....................................................................... 893

25. Projection ................................................................................................................. 895

25.1. Projection .............................................................................................................. 895


25.2. Introduction to projection of samples .................................................................. 895
25.2.1 Basics of projection .............................................................................................................. 895
25.2.2 How to interpret projected samples .................................................................................... 896
25.3. Tasks – Predict – Projection… ............................................................................... 898
25.3.1 Access the Projection functionality ...................................................................................... 898
25.4. Interpreting projection plots ................................................................................ 900
25.4.1 Predefined projection plots ................................................................................................. 901
25.4.2 Plots accessible from the Projection menu .......................................................................... 906
25.5. Projection method reference................................................................................ 913

26. SIMCA....................................................................................................................... 915

26.1. SIMCA classification .............................................................................................. 915


26.2. Introduction to SIMCA classification..................................................................... 915
26.2.1 Making a SIMCA model ........................................................................................................ 915
26.2.2 Classifying new samples....................................................................................................... 916
26.2.3 Main results of classification................................................................................................ 916

xx
Table of Contents

26.2.4 Outcomes of a classification ................................................................................................ 918


26.3. Tasks – Predict – Classification – SIMCA… ............................................................ 918
26.4. Interpreting SIMCA plots ...................................................................................... 921
26.4.1 Predefined SIMCA plots ....................................................................................................... 921
26.5. SIMCA method reference ..................................................................................... 926

27. Linear Discriminant Analysis..................................................................................... 927

27.1. Linear Discriminant Analysis ................................................................................. 927


27.2. Introduction to Linear Discriminant Analysis (LDA) classification ........................ 927
27.2.1 Basics ................................................................................................................................... 927
27.2.2 Data suitable for LDA ........................................................................................................... 928
27.2.3 Purposes of LDA ................................................................................................................... 928
27.2.4 Main results of LDA .............................................................................................................. 929
27.2.5 LDA application examples .................................................................................................... 929
27.2.6 How to interpret LDA results ............................................................................................... 929
27.2.7 Using an LDA model for classification of unknowns ............................................................ 930
27.3. Tasks – Analyze – Linear Discriminant Analysis .................................................... 930
27.3.1 Inputs ................................................................................................................................... 930
27.3.2 Weights ................................................................................................................................ 931
27.3.3 Options ................................................................................................................................ 932
27.3.4 Autopretreatment ............................................................................................................... 933
27.4. Tasks – Predict – Classification – LDA… ................................................................ 934
27.5. Interpreting LDA results ........................................................................................ 934
27.5.1 Prediction ............................................................................................................................. 935
27.5.2 Confusion matrix .................................................................................................................. 935
27.5.3 Loadings matrix .................................................................................................................... 936
27.5.4 Grand mean matrix .............................................................................................................. 936
27.5.5 Discrimination Plot............................................................................................................... 936
27.6. LDA method reference .......................................................................................... 936
27.7. Bibliography .......................................................................................................... 937

28. Support Vector Machine Classification..................................................................... 939

28.1. Support Vector Machine Classification (SVMC) .................................................... 939


28.2. Introduction to Support Vector Machine (SVM) classification ............................. 939
28.2.1 Principles of Support Vector Machine (SVM) classification ................................................. 939
28.2.2 What is SVM classification? ................................................................................................. 939
28.2.3 Data suitable for SVM classification ..................................................................................... 941
28.2.4 Main results of SVM classification ....................................................................................... 941
28.2.5 More details about SVM Classification ................................................................................ 942
28.2.6 SVM classification application examples ............................................................................. 942
28.3. Tasks – Analyze – Support Vector Machine classification .................................... 942

xxi
The Unscrambler X v10.3

28.3.1 Model input ......................................................................................................................... 942


28.3.2 Options ................................................................................................................................ 943
28.3.3 Grid Search........................................................................................................................... 946
28.3.4 Weights ................................................................................................................................ 947
28.3.5 Validation ............................................................................................................................. 948
28.4. Tasks – Predict – Classification – SVM… ............................................................... 950
28.5. Interpreting SVM Classification results ................................................................. 951
28.5.1 Support vectors.................................................................................................................... 951
28.5.2 Confusion matrix .................................................................................................................. 951
28.5.3 Parameters........................................................................................................................... 952
28.5.4 Probabilities ......................................................................................................................... 952
28.5.5 Prediction ............................................................................................................................. 953
28.5.6 Accuracy ............................................................................................................................... 953
28.5.7 Plot of classification results ................................................................................................. 954
28.5.8 Classified range .................................................................................................................... 954
28.6. SVM method reference ......................................................................................... 955
28.7. Bibliography .......................................................................................................... 955

29. Batch Modeling ........................................................................................................ 957

29.1. Batch Modeling (BM) ............................................................................................ 957


29.2. Introduction to Batch Modeling (BM)................................................................... 957
29.2.1 What is Batch Modeling ....................................................................................................... 957
29.3. Tasks – Analyze – Batch Modeling… ..................................................................... 957
29.3.1 Model Inputs tab ................................................................................................................. 957
29.3.2 Weights tab .......................................................................................................................... 959
29.3.3 Validation tab....................................................................................................................... 961
29.3.4 Warning Limits tab ............................................................................................................... 962
29.4. Interpreting BM plots............................................................................................ 964
29.4.1 Predefined BM plots ............................................................................................................ 965
29.5. BM method reference........................................................................................... 965

30. Moving Block ............................................................................................................ 967

30.1. Moving Block......................................................................................................... 967


30.2. Introduction to Moving Block. .............................................................................. 967
30.2.1 Block Definitions .................................................................................................................. 967
30.2.2 Individual Block Mean (IBM) ................................................................................................ 968
30.2.3 Individual Block Standard Deviation (IBSD).......................................................................... 969
30.2.4 Moving Block Mean (MBM) ................................................................................................. 969
30.2.5 Moving Block Standard Deviation (MBSD) ........................................................................... 969
30.2.6 Percent Relative Standard Deviation (%RSD) ....................................................................... 970
30.3. Tasks – Analyze – Moving Block Methods ............................................................ 971

xxii
Table of Contents

30.3.1 Input data pane.................................................................................................................... 971


30.3.2 Region .................................................................................................................................. 971
30.4. Interpreting moving block plots............................................................................ 972
30.4.1 Predefined moving block plots ............................................................................................ 973
30.5. Tasks – Predict – Moving Block Statistics.............................................................. 975
30.6. Set Moving Block Limits ........................................................................................ 976

31. Orthogonal Projections to Latent Structures ............................................................ 977

31.1. Orthogonal Projection to Latent Structures ......................................................... 977


31.2. Introduction to Orthogonal Projection to Latent Structures (OPLS) .................... 977
31.2.1 Predictive scores and predictive loading weights ................................................................ 978
31.2.2 Y-loadings............................................................................................................................. 978
31.2.3 Orthogonal scores and orthogonal loading weights and loadings ....................................... 978
31.3. Tasks – Analyze – Orthogonal Projection to Latent Structures ............................ 979
31.3.1 Model Inputs tab ................................................................................................................. 979
31.3.2 Weights tabs ........................................................................................................................ 980
31.3.3 Validation tab....................................................................................................................... 983
31.3.4 Autopretreatments .............................................................................................................. 984
31.4. Interpreting OPLS plots ......................................................................................... 985
31.4.1 Predefined OPLS plots.......................................................................................................... 985
31.5. OPLS method reference ........................................................................................ 994
31.6. Bibliography .......................................................................................................... 994

32. Prediction ................................................................................................................. 995

32.1. Prediction .............................................................................................................. 995


32.2. Introduction to prediction from regression models ............................................. 995
32.2.1 When can prediction be used? ............................................................................................ 995
32.2.2 How does prediction work? ................................................................................................. 996
32.2.3 Short prediction modes for MLR, PLSR and PCR .................................................................. 996
32.2.4 Full prediction by projection onto a PCR or PLSR model ..................................................... 996
32.2.5 Main results of prediction .................................................................................................... 997
32.3. Tasks – Predict – Regression… .............................................................................. 999
32.3.1 Access the Prediction functionality ...................................................................................... 999
32.4. Interpreting prediction plots............................................................................... 1003
32.4.1 Predefined prediction plots ............................................................................................... 1003
32.4.2 Plots accessible from the Prediction menu ........................................................................ 1004
32.5. Prediction method reference.............................................................................. 1008

33. Batch Prediction ..................................................................................................... 1009

33.1. Batch Prediction .................................................................................................. 1009

xxiii
The Unscrambler X v10.3

33.2. Tasks – Predict - Batch Predict ............................................................................ 1009


33.2.1 Inputs and outputs ............................................................................................................. 1009
33.2.2 Display................................................................................................................................ 1010
33.2.3 Options .............................................................................................................................. 1010
33.2.4 Outputs .............................................................................................................................. 1011

34. Multiple Model Comparison .................................................................................. 1013

34.1. Multiple Model Comparison ............................................................................... 1013


34.2. Multiple comparison of y-residuals .................................................................... 1013
34.3. Tasks – Predict – Multiple Model Comparison ................................................... 1013
34.4. Interpreting prediction plots............................................................................... 1015
34.4.1 Predefined prediction plots ............................................................................................... 1015
34.5. Method reference ............................................................................................... 1015

35. Tutorials.................................................................................................................. 1017

35.1. Tutorials .............................................................................................................. 1017


35.1.1 Content of the tutorials ..................................................................................................... 1017
35.1.2 How to use the tutorials .................................................................................................... 1017
35.1.3 Where to find the tutorial data files .................................................................................. 1017
35.2. Complete ............................................................................................................. 1018
35.2.1 Complete cases .................................................................................................................. 1018
35.2.2 Tutorial A: A simple example of calibration ....................................................................... 1019
35.2.3 Tutorial B: Quality analysis with PCA and PLS .................................................................... 1036
35.2.4 Tutorial C: Spectroscopy and interference problems ........................................................ 1069
35.2.5 Tutorial D1: Screening design ............................................................................................ 1092
35.2.6 Tutorial D2: Optimization design ....................................................................................... 1107
35.2.7 Tutorial E: SIMCA classification .......................................................................................... 1120
35.2.8 Tutorial F: Interacting with other programs ...................................................................... 1133
35.2.9 Tutorial G: Mixture design ................................................................................................. 1148
35.2.10 Tutorial H: PLS Discriminant Analysis (PLS-DA) .................................................................. 1164
35.2.11 Tutorial I: Multivariate curve resolution (MCR) of dye mixtures ....................................... 1177
35.2.12 Tutorial J: MCR constraint settings .................................................................................... 1189
35.2.13 Tutorial K: Clustering.......................................................................................................... 1202
35.2.14 Tutorial L: L-PLS Regression ............................................................................................... 1215
35.2.15 Tutorial M: Variable selection and model stability ............................................................ 1231
35.3. Quick ................................................................................................................... 1240
35.3.1 Quick start tutorials ........................................................................................................... 1240
35.3.2 Projection quick start ......................................................................................................... 1241
35.3.3 SIMCA quick start ............................................................................................................... 1243
35.3.4 MLR quick start .................................................................................................................. 1244
35.3.5 PCR quick start ................................................................................................................... 1247
35.3.6 PLS quick start .................................................................................................................... 1254

xxiv
Table of Contents

35.3.7 Prediction quick start ......................................................................................................... 1261


35.3.8 Cluster quick start .............................................................................................................. 1263
35.3.9 MCR quick start .................................................................................................................. 1265
35.3.10 LDA quick start ................................................................................................................... 1268
35.3.11 LDA classification quick start.............................................................................................. 1272
35.3.12 SVM quick start .................................................................................................................. 1273
35.3.13 SVM classification quick start ............................................................................................ 1277
35.3.14 PCA quick start ................................................................................................................... 1278

36. Data Integrity and Compliance ............................................................................... 1283

36.1. Data Integrity ...................................................................................................... 1283


36.2. Statement of Compliance ................................................................................... 1283
36.2.1 Introduction ....................................................................................................................... 1283
36.2.2 Overview ............................................................................................................................ 1283
36.2.3 Other software applications .............................................................................................. 1283
36.2.4 Statement of 21 CFR Part 11 Compliance .......................................................................... 1283
36.3. Compliance mode in The Unscrambler® X .......................................................... 1284
36.3.1 Main features of the compliance mode ............................................................................. 1284
36.3.2 A comprehensive approach to security and data integrity ................................................ 1285
36.4. Digital Signatures ................................................................................................ 1285
36.4.1 Digital Signature implementation in The Unscrambler� X ............................................... 1285
36.4.2 How to assign a digital signature to a project .................................................................... 1286
36.4.3 How to tell if a project has been signed ............................................................................. 1287
36.4.4 Digital signatures and 21 CFR Part 11 ................................................................................ 1288
36.5. References .......................................................................................................... 1288

37. References.............................................................................................................. 1289

37.1. Reference documentation .................................................................................. 1289


37.2. Glossary of terms ................................................................................................ 1289
37.3. Method reference ............................................................................................... 1320
37.4. Keyboard shortcuts ............................................................................................. 1320
37.5. Smarter, simpler multivariate data analysis: The Unscrambler® X..................... 1321
37.5.1 Workflow oriented main screen ........................................................................................ 1322
37.5.2 A new look for a new generation ....................................................................................... 1322
37.5.3 New analysis methods ....................................................................................................... 1325
37.5.4 General improvements and inclusions summary ............................................................... 1327
37.6. What’s new in The Unscrambler® X version 10.3 ............................................... 1328
37.7. What’s new in The Unscrambler® X ver 10.2 ...................................................... 1329
37.8. Applicability......................................................................................................... 1329
37.9. Design of Experiments ........................................................................................ 1330

xxv
The Unscrambler X v10.3

37.10. Overall Enhancements ............................................................................... 1330


37.11. Known Limitations in The Unscrambler® X ver 10.2 .................................. 1332
37.12. What’s new in The Unscrambler® X ver 10.1............................................. 1332
37.13. Data Import ................................................................................................ 1332
37.14. Data Export ................................................................................................ 1332
37.15. Applicability ............................................................................................... 1333
37.16. Design of Experiments ............................................................................... 1333
37.17. Overall Enhancements ............................................................................... 1333
37.18. Known Limitations in The Unscrambler® X ver 10.1 .................................. 1334
37.19. What’s new in The Unscrambler® X ver 10.0.1.......................................... 1334
37.20. Data Import ................................................................................................ 1334
37.21. Tutorials ..................................................................................................... 1334
37.22. Applicability ............................................................................................... 1335
37.23. Design of Experiments ............................................................................... 1335
37.24. Known Limitations in The Unscrambler® X ver 10.0.1 ............................... 1335
37.25. What’s new in The Unscrambler® X........................................................... 1336
37.26. System Requirements ................................................................................ 1337
37.27. Installation ................................................................................................. 1337

38. Bibliography ........................................................................................................... 1339

38.1. Bibliography ........................................................................................................ 1339


38.1.1 Statistics and multivariate data analysis ............................................................................ 1339
38.1.2 Basic statistical tests .......................................................................................................... 1341
38.1.3 Design of experiments ....................................................................................................... 1341
38.1.4 Multivariate curve resolution ............................................................................................ 1342
38.1.5 Classification methods ....................................................................................................... 1342
38.1.6 Data transformations and pretreatments .......................................................................... 1343
38.1.7 L-shaped PLS ...................................................................................................................... 1344
38.1.8 Martens’ uncertainty test .................................................................................................. 1344
38.1.9 Data formats ...................................................................................................................... 1344

xxvi
1. WelcometoTheUnscrambler®X
The Unscrambler® is a complete multivariate data analysis and experimental design software
solution, equipped with powerful methods including PCA, PLS, clustering and classification.

 Getting to know The Unscrambler®


 Video demonstration of the new user interface
 Migrating from earlier versions
 Tutorials
 Keyboard shortcuts
 How to use the help documentation

See the release notes for a list of fixes, new features and known limitations.

1
2. Support Resources
2.1. Support resources on our website
Our web site is filled with resources, case studies, recorded webinars as well as information
about our products and commercial offerings, including courses and professional services.

 Support
 Webinars
 Training courses
 Consulting

3
3. Overview
3.1. What is The Unscrambler® X?
A brief review of the tasks that can be carried out using The Unscrambler® X.

 Multivariate analysis simplified


 Make well-designed experimental plans
 Reformat, transform and plot data
 Study variations among one group of variables
 Study relations between two groups of variables
 Validate multivariate models with uncertainty testing
 Estimate new, unknown response values
 Classify unknown samples
 Reveal groups of samples

3.1.1 Multivariate analysis simplified


The main strength of The Unscrambler® X is to provide simple to use tools for analysis of any
sort of multivariate data. This involves finding variations, co-variations and other internal
relationships in data matrices (tables). One can also use The Unscrambler® X set up an
experimental design to achieve the maximum information as efficiently as possible.
The following are the basic types of problems that can be solved using The Unscrambler® X:

 Set up experiments, analyze effects and find optima using the Design of Experiments
(DoE) module;
 Reformat and preprocess data to enhance future analyses;
 Find relevant variation in one data matrix (X);
 Find relationships between two data matrices (X and Y);
 Validate multivariate models with Uncertainty Testing;
 Resolve unknown mixtures by finding the number of pure components and
estimating their concentration profiles and spectra;
 Predict the unknown values of a response variable;
 Classify unknown samples into various possible categories.

One should always remember, however, that there is no point in trying to analyze data if
they do not contain any meaningful information. Experimental design is a valuable tool for
building data tables which give such meaningful information. The Unscrambler® can help to
do this in an elegant way.
The Unscrambler® satisfies the US FDA’s requirements for 21 CFR Part 11 compliance.

3.1.2 Make well-designed experimental plans


Choosing samples carefully increases the chance of extracting useful information from data.
Furthermore, being able to actively experiment with the variables also increases the chance
of extracting relationships. The critical part is deciding which variables to change, which
intervals to use for this variation, and the pattern of the experimental points.

5
The Unscrambler X Main

The purpose of experimental design is to generate experimental data that enable one to
determine which design variables (X) have an influence on the response variables (Y), in
order to understand the interactions between the design variables and thus determine the
optimum conditions. Of course, it is equally important to do this with a minimum number of
experiments to reduce costs. An experimental design program should offer appropriate
design methods and encourage good experimental practice, i.e. allow one to perform few
but useful experiments which span the important variations.
Screening designs (e.g. fractional, full factorial and Plackett-Burman) are used to find out
which design variables have an effect on the responses and are suitable for collection of data
spanning all important variations.
Optimization designs (e.g. central composite, Box-Behnken) aim to find the optimum
conditions for a process and generate nonlinear (quadratic) models. They generate data
tables that describe relationships in more detail, and are usually used to refine a model, i.e.
after the initial screening has been performed.
Whether the purpose of designed experiments is screening or optimization, there may be
multilinear constraints among some of the design variables. In such a case a D-optimal
design may be required.
Another special case is that of mixture designs, where the main design variables are the
components of a mixture. The Unscrambler® provides the classical types of mixture designs,
with or without additional constraints.
There are several methods for analysis of experimental designs. The Unscrambler® uses
Multiple Linear Regression (MLR) as its default methods for orthogonal designs. For non-
orthogonal designs, or when the levels of a design cannot be reached, The Unscrambler®
allows the use other methods, such as PCR or PLS, for this purpose.

3.1.3 Reformat, transform and plot data


Raw data may have a distribution that is not optimal for analysis. Background effects,
measurements in different units, different variances in variables etc. may make it difficult for
the methods to extract meaningful information. Preprocessing or transformations help in
reducing the “noise” introduced by such effects.
Before applying transforms, it is important to look at the data from a slightly different point
of view. Sorting samples or variables and transposing the data table are examples of such
reformatting operations.
Whether the data have been reformatted and transformed or not, a quick plot may reveal
more about the data than is to be seen with the naked eye on a mere collection of numbers.
Various types of plots are available in The Unscrambler®. They facilitate visual checks of
individual variable distributions, allow one to study the correlation among two variables or
examine samples as for example a 3-D swarm of points or a 3-D landscape.

3.1.4 Study variations among one group of variables


A common problem is to determine which variables actually contribute to the variation seen
in a given data matrix; i.e. to find answers to questions such as

 “Which variables are necessary to describe the samples adequately?”


 “Which samples are similar to each other?”
 “Are there groups of samples in a particular data set?”
 “What is the meaning of these sample patterns?”

6
Overview

The Unscrambler® finds this information by decomposing the data matrix into a structured
part and a noise part, using a technique called Principal Component Analysis (PCA).

Other methods to describe one group of variables


Classical descriptive statistics are also available in The Unscrambler®. Mean, standard
deviation, minimum, maximum, median and quartiles provide an overview of the univariate
distributions of variables, allowing for their comparison. In addition, the correlation matrix
provides a summary of the covariations among variables.
In the case of instrumental measurements (such as spectra or voltammograms) performed
on samples representing mixtures of a few pure components at varying concentrations or at
different stages of a process (such as chromatography), The Unscrambler® offers a method
for recovering the unknown concentrations, called Multivariate Curve Resolution (MCR).

3.1.5 Study relations between two groups of variables


Another common problem is establishing a regression model between two data matrices.
For example, one may have a set of many inexpensive measurements (X) of properties of a
set of different solutions (for example), and want to relate these measurements to the
concentration of a particular compound (Y) in the solution. The concentrations of the
particular compound are usually found using a reliable reference method.
In order to do this, it is necessary to find the relationship between the two data matrices.
This task varies somewhat depending on whether the data have been generated using
statistical experimental design or have simply been collected, more or less at random, from
a given population (i.e. non-designed data).

How to analyze designed data matrices


The variables in designed data tables (excluding mixture or D-optimal designs) are
orthogonal. Traditional statistical methods such as ANOVA and MLR are well suited to make
a regression model from orthogonal data tables.

How to analyze non-designed data matrices


The variables in non-designed data matrices are seldom orthogonal, but rather more or less
collinear with each other. MLR will most likely fail in such circumstances, so the use of
projection techniques such as PCR or PLS is recommended.

3.1.6 Validate multivariate models with uncertainty testing


Whatever the purpose in multivariate modeling – explore, describe precisely, build a
predictive model – validation is an important issue. Only a proper validation can ensure that
the model results are not too highly dependent on some extreme samples, and that the
predictive power of the regression model meets the experimental objectives.
With the help of Martens’ Uncertainty Test, the power of cross validation is further
increased and allows one to:
 Study the influence of individual samples in a model with powerful, simple to
interpret graphical representations;
 Test the significance of the predictor variables and remove unimportant predictors
from a PLS or PCR model.

7
The Unscrambler X Main

3.1.7 Estimate new, unknown response values


A regression model can be used to predict new, i.e. unknown, Y-values. Prediction is a useful
technique as it can replace costly and time consuming measurements. A typical example is
the prediction of concentrations from absorbance spectra instead of direct measurements of
them by, for example titration.

3.1.8 Classify unknown samples


Classification simply means to find out whether new samples are similar to classes of
samples that have been used to make models in the past. If a new sample fits a particular
model well, it is said to be a member of that class. Classification can be done using several
different techniques including SIMCA, LDA, SVM classification and PLS-DA.
Many analytical tasks fall into this category. For example, raw materials may be sorted into
“good” and “bad” quality, finished products classified into grades “A”, “B”, “C”, etc.

3.1.9 Reveal groups of samples


Clustering attempts to group samples into ‘k’ clusters based on specific distance
measurements.
In The Unscrambler®, clustering can be applied to a data set using the K-Means algorithm, as
well as using hierarchical clustering (HCA). Seven different types of distance measurements
are provided (including Chebyshev and Bray-Curtis) along with popular algorithms, including
Ward’s method.
Overall, The Unscrambler® is a complete, All-In-One Multivariate Data Analysis and Design of
Experiment package, which can be used to investigate simple, through to extremely large
and complex data tables, for most applications. It provides the analytical tools most
commonly used and requested by most data analysts. The plug in architecture allows for the
inclusion new transforms and methods as they become available and software validation has
been greatly simplified as a result of this. The Unscrambler® meets the data security
requirements for regulated industries.

Related topics:
 User interface basics
 Principles of regression
 Principles of classification

3.2. Principles of classification


Multivariate classification is split into two equally important areas: cluster analysis and
discriminant analysis.
Cluster analysis methods can be used to find groups in the data without any predefined class
structure and are referred to as unsupervised learning. Cluster analysis is highly exploratory,
but can sometimes, especially at an early stage of an investigation, be very useful.
Discriminant analysis is a supervised classification method, as it is used to build classification
rules for a number of prespecified classes. These rules (model) are later used for allocating
new and unknown samples to the most probable class. Another important application of
discriminant analysis is to help in interpreting differences between groups of samples.

8
Overview

 Purposes of classification
 Classification methods
 SIMCA classification
 Linear Discriminant Analysis
 Support Vector Machines classification
 PLS Discriminant Analysis
 Steps in SIMCA classification
 Classifying new samples
 Outcomes of a classification
 Classification based on a regression model

3.2.1 Purposes of classification


The main goal of classification is to reliably assign new samples to existing classes (in a given
population). Note that classification is not the same as clustering.
One can also use classification results as a diagnostic tool:
 to distinguish among the most important variables to keep in a model (variables that
“characterize” the population);
 or to find outliers (samples that are not typical of the population).
It follows that, contrary to regression, which predicts the values of one or several
quantitative variables, classification is useful when the response is a category variable that
can be interpreted in terms of several classes to which a sample may belong.
Examples of such situations are:
 Predicting whether a product meets quality requirements, where the result is simply
“Yes” or “No” (i.e. binary response).
 Modeling various close species of plants or animals according to their easily
observable characteristics, so as to be able to decide whether new individuals
belong to one of the modeled species.
 Modeling various diseases according to a set of easily observable symptoms, clinical
signs or biological parameters, so as to help future diagnostic of those diseases.

3.2.2 Classification methods


This chapter presents the purpose of sample classification, and provides a brief overview of
the classification methods available in The Unscrambler®:

 Soft Independent Modeling of Class Analogy (SIMCA)


 Linear Discriminant Analysis (LDA)
 Support Vector Machine (SVM) Classification

Unsupervised classification methods:

 Cluster analysis
 Projection

Discriminant analysis is a kind of qualitative calibration, where the quantity to be calibrated


for is a category group variable, and not a continuous measurement as would be the case for
a quantitative calibration (regression).

9
The Unscrambler X Main

It grew out of work by biologists working on numerical taxonomy, and is a valuable


visualization tool in data mining. One can perform clustering using either several
agglomerative methods: K-means or K-median clustering, or hierarchical clustering with
different linkage measures (single-linkage, complete-linkage, average-linkage, median-
linkage, etc.). Agglomerative methods begin by treating each sample as a single cluster and
begin clustering samples based on their similarity until one large cluster is formed.
The main categories of cluster analysis in The Unscrambler® are nonhierarchical clustering
(K-means, K-medians) and hierarchical cluster analysis (HCA).
SIMCA classification
Soft Independent Modeling of Class Analogy (SIMCA) is based on making a PCA model for
each class in the training set. Unknown samples are then compared to the class models and
assigned to classes according to their analogy to the training samples.
Linear Discriminant Analysis
Linear Discriminant Analysis (LDA) is the simplest of all possible classification methods that
are based on Bayes’ formula. The objective of LDA is to determine the best fit parameters
for classification of samples by a developed model. The model can then be used to classify
unknown samples. It is based on the normal distribution assumption and the assumption
that the covariance matrices of the two (or more) groups are identical.
Support Vector Machines classification
Support Vector Machines (SVM) is a classification method based on statistical learning.
Sometimes, a linear function is not able to model complex separations, so SVM employs
kernel functions to map from the original space to the feature space. The function can be of
many forms, thus providing the ability to handle nonlinear classification cases. The kernels
can be viewed as a mapping of nonlinear data to a higher dimensional feature space, while
providing a computation shortcut by allowing linear algorithms to work with higher
dimensional feature space.
PLS Discriminant Analysis
The discriminant analysis approach differs from the SIMCA approach in that it assumes that
a sample has to be a member of one of the classes included in the analysis. The most
common case is that of a binary discriminant variable: a question with a Yes / No answer.
Binary discriminant analysis is performed using regression, with the discriminant variable
coded 0 / 1 (Yes = 1, No = 0) as the Y-variable in the model.
With PLS, this can easily be extended to the case of more than two classes. Each class is
represented by an indicator variable, i.e. a binary variable with value 1 for members of that
class, 0 for non-members. By building a PLS model with all indicator variables as Y, one can
directly predict class membership from the X-variables describing the samples. The model is
interpreted by viewing the Predicted vs. Reference plot for each class indicator Y-variable:

 Ypred > 0.5 means “roughly 1” that is to say member;


 Ypred < 0.5 means “roughly 0” that is to say non-member.

Once the PLS model has been checked and validated (see the chapter about multivariate
regression for more details on diagnosing and validating a model), one can run a Prediction
in order to classify new samples. The prediction results are interpreted by viewing the plot
Predicted with Deviations for each class indicator Y-variable:

10
Overview

 Samples with Ypred > 0.5 and a deviation that does not cross the 0.5 line are
predicted members;
 Samples with Ypred < 0.5 and a deviation that does not cross the 0.5 line are
predicted nonmembers;
 Samples with a deviation that crosses the 0.5 line cannot be safely classified.
See Chapter Prediction for more details on how to run a prediction and interpret results. A
tutorial explaining PLS-DA in practice is also available: PLS Discriminant Analysis.

3.2.3 Steps in SIMCA classification


Solving a classification problem requires two steps:
 Modeling: Build one separate model for each class;
 Classifying new samples: Fit each sample to each model and decide whether the
sample belongs to the corresponding class.
The modeling stage implies that enough samples have been identified as members of each
class to be able to build a reliable model. It also requires enough variables to describe the
samples accurately.
The actual classification stage uses significance tests, where the decisions are based on
statistical tests performed on the object-to-model distances.

3.2.4 Classifying new samples


Once each class has been modeled, and provided that the classes do not overlap too much,
new samples can be fitted to (projected onto) each model. This means that for each sample,
new values for all variables are computed using the scores and loadings of the model, and
compared to the actual values.
The residuals are then combined into a measure of the object-to-model distance.
The scores are also used to build up a measure of the distance of the sample to the model
center, called leverage.
Finally, both object-to-model distance and leverage are taken into account to decide which
class(es) the sample belongs to.
The classification decision rule is based on a classical statistical approach. If a sample belongs
to a class, it should have a small distance to the class model (the ideal situation being
“distance=0”). Given a new sample, one needs to compare its distance to the model to a
class membership limit reflecting the probability distribution of object-to-model distances
around zero.

3.2.5 Outcomes of a classification


There are three possible outcomes of a classification:

 Unknown sample belongs to one class;


 Unknown sample belongs to several classes;
 Unknown sample belongs to none of the classes.

The first case is the easiest to interpret.


If the classes have been modeled with enough precision, the second case should not occur
(no overlap). If it does occur, this means that the class models might need improvement, i.e.
more calibration samples and/or additional variables should be included.

11
The Unscrambler X Main

The last case is not necessarily a problem. It may be a quite interpretable outcome,
especially in a one-class problem. A typical example is product quality prediction, which can
be done by modeling the single class of acceptable products. If a new sample belongs to the
modeled class, it is accepted; otherwise, it is rejected.

3.2.6 Classification based on a regression model


Throughout this chapter, SIMCA classification is described as a method involving disjoint PCA
modeling. Instead of PCA models, one can also use PCR or PLS models. In those cases, only
the X-part of the model will be used. The results will be interpreted in exactly the same way.
SIMCA classification based on the X-part of a regression model is a nice way to detect
whether new samples are suitable for prediction. If the samples are recognized as members
of the class formed by the calibration sample set, the predictions for those samples should
be reliable. Conversely, one should avoid using any model for extrapolation, i.e. making
predictions on samples which are rejected by the classification.
Besides, classification may be achieved with a regression technique called Linear
Discriminant Analysis (LDA), which is an alternative to SIMCA.

3.3. How to use help


The help system has been implemented to provide help and advice to those working with
The Unscrambler®. Help covers use of the dialogs and methods, and interpretation of plots.
For best viewing of the contents users are recommended to have Internet Explorer 7.0 or
higher.

 How to open the help documentation


 Browsing the contents
 Searching the contents
 Typographic cues

3.3.1 How to open the help documentation


Press the F1 key or click on the ? help button near the top right corner of the active dialog
window to read help for the appropriate topic.
The help documentation can also be opened for browsing by selecting Help - Contents from
the menu, or pressing the Help button in the toolbar.
Several levels of help are available. Click on underlined words to follow built-in hypertext
links to related topics.

3.3.2 Browsing the contents


The Help documentation can be read as a book by clicking through the chapters and
sections, accessing chapters from the table of contents displayed to the left.
The left-most window consists of two tabs for switching between a Contents hierarchical
view, and the Search utility.

3.3.3 Searching the contents


The search engine allows one to search for occurrences of one or several words. Select a
page from the result list to read it.

12
Overview

Use Find in page to search for a phrase within the current page.

3.3.4 Typographic cues


The help documentation text itself provides typographic cues to the reader:
 Emphasized text (italic) indicate important concepts, or variables.
 Strong emphasis (bold) indicate actions, e.g. a menu entry or button.
 Dotted underline indicate abbreviations. Hover the mouse pointer over such text for
a tooltip explanation for the acronym.
 Computer code text indicate file name selectors like *.unsb, and command input
such as X=sqrt(X).
 A globe icon indicates that the hypertext link will open external content in the
system default web browser, such as http://www.camo.com/
 A table grid icon indicates that the hypertext link will open, import or download a
data set, like this: Import the tutorial A data
 Hovering the mouse pointer over figures will display the caption as a tooltip.
Useful tips are put in text boxes like this.
Caution notes are put in text boxes like this.

3.4. Principles of regression


Regression is used to find out about how well some predictor variables (X) explain the
variations in some response variables (Y) using methods such as MLR, PCR, PLSR and L-PLSR.

 What is regression?
 General notation and definitions
 The whys and hows of regression modeling
 What is a good regression model?
 Regression methods in The Unscrambler®
 Multiple Linear Regression (MLR)
 Principal Component Regression (PCR)
 Partial Least Squares Regression (PLSR)
 L-PLS Regression
 Support Vector Machine Regression (SVMR)
 Calibration, validation and related samples
 Main results of regression
 Making the right choice with regression methods
 How to interpret regression results
 How to detect nonlinearities (lack of fit)
 What are outliers and how are they detected?
 Guidelines for calibration of spectroscopic data

3.4.1 What is regression?


Regression is a generic term used for all methods that attempt to model and analyze several
variables with the purpose of building a relationship between two groups of variables,
namely the independent and dependent variables. The fitted model may then be used to
either just describe the relationship between the two groups of variables, or to predict new
values.

13
The Unscrambler X Main

General notation and definitions


The two data matrices involved in regression are usually denoted X (independent,
predictors) and Y (dependent, responses), and the purpose of regression is to build a model
. Such a model is used to explain, or predict, the variations in the Y-variable(s)
from the variations in the X-variable(s). The link between X and Y is achieved through a
common set of samples for which both X- and Y-values have been collected.
Names for X and Y
The X- and Y-variables can be denoted with a variety of terms, according to the particular
context (or culture). The most common ones are listed in the table below:
Usual names for X- and Y-variables
Context X Y

General Predictors Responses

MLR Independent Variables Dependent Variables

Designed Data Factors, Design Variables Responses

Spectroscopy Spectra Constituents

Chromatography Chromatograms Concentrations

Univariate vs. multivariate regression


Univariate regression uses a single predictor to define a relationship with a response. The
classical example in chemistry is the Beer-Lambert law for spectroscopy, where a straight
line model is established to relate concentration to absorbance. In this case, physical sample
preparation is required to “clean the signal” to ensure that the relationship between
absorbance and concentration holds. However, in most practical applications a single
predictor is not sufficient to model a property precisely. The form of the model is described
by,

Where b0 is an intercept term and b1 is a regression coefficient; in this case, the slope of the
straight line.
Multivariate regression takes into account several predictor variables, thus modeling the
property of interest with more accuracy. The form of the model is

Where the terms in the equation are defined as usual. This chapter focuses on the general
principles of multivariate regression.
The whys and hows of regression modeling
Building a regression model involves collecting the predictors and the corresponding
response values for a set of samples, and then finding the optimal parameters in a
predefined mathematical relationship to the collected data. A commonly used measure of
optimality is the minimization of the sum of squares of the deviations between the
measured and predicted responses.
For example, in analytical chemistry, spectroscopic measurements are made on solutions
with known concentrations of a component of interest. Regression is then used to relate the
concentration of the component of interest to the spectrum.

14
Overview

Once a regression model has been built, it can be used to predict the unknown
concentration for new samples, using the spectroscopic measurements as predictors. The
advantage is obvious if the concentration is difficult or expensive to measure directly.
Replacement with the spectroscopic method is less expensive and in some cases, requires
minimal to no sample preparation. It also allows for development of spectroscopic
measurements for real-time process monitoring.
The most common motivations for developing regression models as predictive tools may
include:
 Replacement of expensive or time-consuming analysis methods, with cheap, rapid,
easy-to-perform measurements (e.g. NIR spectroscopy, mass spectrometry for gas
analysis).
 When one wants to build a response surface model from the results of some
experimental design, i.e. describe precisely the response levels according to the
values of a few controlled factors.
What is a good regression model?
The purpose of a regression model is to extract all the information relevant for the
prediction of the response from the available data.
Unfortunately, observed data usually contains some amount of noise and in some cases,
irrelevant information.
Noise can be random variation in the response due to experimental error, or it can be
random variation in the data values due to measurement error. It may also be some amount
of response variation due to factors which are not included in the model.
Irrelevant information is carried by predictors which have little or nothing to do with the
modeled phenomenon. For instance, NIR absorbance spectra may carry some information
relative to the solvent and not only to the compound of interest in developing a model to
predict the concentration of the compound in solution.
A good regression model should be able to:
 Model only relevant information, by highly weighting these sources of information
and downweighting any irrelevant variation.
 Avoid overfitting, i.e. distinguish between variation in the response (that can be
explained by variation in the predictors), and variation caused by mere noise.
Regression methods in The Unscrambler®
The Unscrambler® provides five regression method choices:

 Multiple Linear Regression (MLR)


 Principal Component Regression (PCR)
 Partial Least Squares Regression (PLSR)
 L-PLSR Regression
 Support Vector Machine Regression

3.4.2 Multiple Linear Regression (MLR)


MLR is a well-known statistical method based on ordinary least squares regression. It
estimates the model coefficients by the equation:

This operation involves a matrix inversion, which can be numerically unstable when there is
collinearity, that is when the variables are not linearly independent. Incidentally, this is the

15
The Unscrambler X Main

reason why the predictors are called independent variables in MLR; the ability to vary
independently of each other is a crucial requirement to variables used as predictors with this
method. MLR requires more samples than predictors since the system with more variables
than samples would not have a unique solution.
The Unscrambler® uses The QR Decomposition to find the MLR solution. No missing values
are accepted.
More details about MLR regression can be found in the section Multiple Linear Regression
(MLR)

3.4.3 Principal Component Regression (PCR)


PCR is a two-step procedure which first decomposes the X-matrix by PCA, then fits an MLR
model, using the PCs instead of the original X-variables as predictors.
PCR procedure

More about PCR can be found in the help section Principal Component Regression (PCR)
More information about the PCR algorithm can be found in Method References.

3.4.4 Partial Least Squares Regression (PLSR)


Partial Least Squares regression (PLSR, sometimes referred to as Projection to Latent
Structures or simply PLS) models both the X- and Y-matrices simultaneously to find the
latent variables in X that will best predict the latent variables in Y. These PLSR components
are similar to principal components; however, they are referred to as factors.
PLSR procedure

16
Overview

More about PLS regression can be found in the help section Partial Least Squares Regression
(PLSR)
More details regarding the PLSR algorithm are given in the Method References.

3.4.5 L-PLS Regression


Traditionally, science demanded that a one-to-one relationship between a cause and effect
existed; however, this tradition can hinder the study of more complex systems. Such systems
may be characterized by many-to-many relationships, which are often hidden in large tables
of data.
In some cases, the Y data may have descriptors of its columns, organized in a third table Z
(containing the same number of columns as in Y).
The three matrices X, Y and Z can together be visualized in the form of an L-shaped
arrangement. Such data analysis has potential widespread use in areas such as consumer
preference studies, medical diagnosis and spectroscopic applications.

17
The Unscrambler X Main

More about L-PLS regression can be found in the help section L-PLS Regression
More details regarding the L-PLSR algorithm are given in the Method References.

3.4.6 Support Vector Machine Regression (SVMR)


Unlike the bilinear methods of PCR/PLSR, Support Vector Machine SVMR uses kernels to
transform non-linear systems into linear systems before the application of regression. This is
done by selecting an appropriate kernel and fine tuning its parameters to achieve an
acceptable result (if such a result exists).
A simple diagrammatic representation of SVMR is provided below,
How SVMR Works

More about SVMR can be found in the help section Support Vector Machine Regression
(SVMR)
More details regarding the SVMR algorithm are given in the Method References.

3.4.7 Calibration, validation and related samples


All regression modeling must include some form of validation (i.e. testing) to make sure that
the results obtained can be applied to new data. This requires two separate steps in the
computation of a model, whether it be PCA, MLR, PCR, PLSR, etc.
Calibration
Modeling the relevant information in a set of data used as a training set.

18
Overview

Validation
Checking whether the model is capable of performing its task on a separate test set
of data.
Calibration is the fitting stage in the regression modeling process. The main data set,
containing only the calibration sample set, is used to compute the model parameters (PCs,
regression coefficients).
It is essential to validate models to get an idea of how well a regression model will perform
when it is used to predict new, unknown samples. A test set consisting of samples with
known response values is used. Only the X-values are fed into the model, from which
response values are predicted and compared to the known, actual response values. The
model is validated if the prediction residuals are low and there is no evidence of lack of fit in
the model.
Each of the two steps described above requires its own set of samples; thus, the following
terms are used interchangeably calibration samples = training samples and validation
samples = test samples.
A more detailed description of validation techniques and their interpretation is to be found
in the chapter Validate a Model.

3.4.8 Main results of regression


The main results of a regression analysis vary depending on the method used. They may be
roughly divided into two categories:
Diagnosis
results that are used to check the validity and quality of the model;
Interpretation
results that provide mechanistic insights into the relationship between X and Y, as
well as (for projection methods only) sample properties.
Note: Some results, e.g. scores, may be considered as belonging to both categories
(scores can help in the detection of outliers, and they also give information about
differences or similarities among samples).
The table below lists the various types of regression results computed in The Unscrambler®,
their application area (diagnosis or interpretation) and the regression method(s) for which
they are available.
Regression results available for each method
Result Appl. MLR PCR PLSR

B-coefficients I X X X

Predicted Y-values I,D X X X

Residuals 1 D X X X

Error Measures D X X X

ANOVA D X

Scores and Loadings 2 I,D X X

Loading weights I,D X

19
The Unscrambler X Main

In short, all three regression methods give a model with an equation expressed by the
regression coefficients (b-coefficients), from which predicted Y-values are computed. For all
methods, residuals can be computed as the difference between predicted (fitted) values and
actual (observed) values; these residuals can then be combined into error measures that tell
how well a model performs.
PCR and PLSR, in addition to those standard results, provide powerful interpretation and
diagnostic tools linked to projection: more elaborate error measures, as well as scores and
loadings.
The simplicity of MLR, on the other hand, allows for simple significance testing of the model
with ANOVA and of the b-coefficients with a Student’s t-test (ANOVA will not be presented
hereafter; read more about it in the ANOVA section from Chapter “Analyze Results from
Designed Experiments”.) However, significance testing is also possible in PCR and PLSR, using
Martens’ Uncertainty Test.

B-coefficients
The regression model can be written

meaning that the observed response values (Y) are approximated by a linear combination of
the values of the predictors (X). The coefficients of that combination are called regression
coefficients or B-coefficients.
Several diagnostic statistics are associated with the regression coefficients (available only for
MLR):
Standard error is a measure of the precision of the estimation of a coefficient;
From that, a student’s t-value can be computed;
Comparing the t-value to a reference t-distribution will then yield a significance level or p-
value. It provides an indication that the regression coefficients are significantly different
from 0. If the t-value is found to be nonsignificant this means that the regression coefficient
cannot be distinguished from 0.

Predicted Y-values
Predicted Y-values are computed for each sample by applying the model equation (i.e. the B-
coefficients) to new (or existing) observed X-values.
For PCR or PLSR models, the predicted Y-values can also be computed using projection along
the successive components of the model. This has the advantage of diagnosing samples
which are badly represented by the model, and therefore have high prediction uncertainty.
This is discussed more fully in the chapter Predictions.

Residuals
For each sample, the residual is the difference between the observed Y-value and the
predicted Y-value. It appears as the term e in the model equation.
More generally, residuals may also be computed for each fitting operation in a projection
model: thus the samples have X- and Y-residuals along each PC (factor) in PCR and PLSR
models. Read more about how sample and variable residuals are computed in the chapter
More Details About the Theory of PCA.

20
Overview

Scores and loadings (in general)


In PCR and PLSR models, scores and loadings express how the samples and variables are
projected along the model components.
PCR uses the same scores and loadings as PCA, since PCA is used in the decomposition of X. Y
is then projected onto the “plane” defined by the MLR equation, and no extra scores or
loadings are required to express this operation.
Read more about PCA scores and loadings in Chapters PCA and How to Interpret PCA Scores
and Loadings. PCR and PLSR scores and loadings are presented in the relevant sections for
these topics.
L-PLSR is further described in the method section on this topic. L-PLSR

3.4.9 Making the right choice with regression methods


It may be somewhat confusing to have a choice between three different methods that
apparently solve the same problem, i.e. fit a model in order to approximate Y as a linear
function of X.
The sections that follow provide a comparison of the three methods and may aid in selecting
the one which is best suited to specific analysis objectives.

MLR vs. PCR vs. PLSR vs. SVMR


MLR has the following properties and behavior:
 The number of X-variables must be smaller than the number of samples;
 In case of collinearity among X-variables, the b-coefficients are not reliable and the
model may be unstable;
 MLR tends to overfit when noisy data are used.
PCR and PLSR are projection methods, like PCA.
Model components are extracted in such a way that the first PC/factor explains the largest
amount of variation, followed by the second PC/factor, etc. At a certain point, the variation
modeled by any new PC/factor is mostly noise. The optimal number of PCs/factors -
modeling useful information, but avoiding overfitting - is determined with the help of the
residual variances.
PCR uses MLR in the regression step; a PCR model using all PCs gives the same solution as
MLR (as does a PLSR model using all factors).
If one were to run MLR, PCR and PLSR on the same data, their performance could be
compared by checking validation errors (Predicted vs. Measured Y-values for validation
samples, RMSEP).
It should also be noted that both MLR and PCR can only model one Y-variable at a time.
The difference between PCR and PLSR lies in the algorithm. PLSR uses the information lying
in both X and Y to fit the model, switching between X and Y iteratively to find the relevant
factors. So PLSR often needs fewer factors to reach the optimal solution because the focus is
on the prediction of the Y-variables (not on achieving the best projection of X as in PCA).
SVMR is a special class of regression that is very distinct from all of the methods described
above. SVMR uses kernels to map variable space to feature space in order to minimise
particular errors associated with the calibration development. This is done by

 Selecting a specific kernel function that is capable of mapping the variable space.
 Fine tuning the parameters of the chosen function such that the best calibration and
prediction statistics are achieved.

21
The Unscrambler X Main

SVMR provides the least graphical output and diagnostics statistics of all the regression
methods implemented in The Unscrambler® and can often pose a difficult task for the user
to develop robust models. However, when they work, SVMR models are much better able to
handle non-linearities than MLR/PCR/PLSR models and can provide an alternative method to
Artificial Neural Networks (ANN).

How to select a regression method


If there is more than one Y-variable, PLSR is usually the best method if the objective is to
interpret all variables simultaneously. It is often argued that PLSR or PCR gives better
prediction ability. This is usually true if there are strong nonlinearities in the data, in which
case modeling each Y-variable separately according to its own nonlinear features might
perform better than trying to build a common model for all Ys. On the other hand, if the Y-
variables are somewhat noisy, but strongly correlated, PLSR is the best way to model the
whole information and minimize the influence of noise.
The difference between PLSR and PCR in prediction error is usually quite small, but PLSR will
usually give results comparable to PCR results using fewer components.
MLR should only be used if the number of X-variables is low (around 20 or less) and there
are only small correlations among them.
Formal tests of significance for the regression coefficients are well-known and accepted for
MLR. If using PCR or PLSR, one can check the stability of the results and the significance of
the regression coefficients with Martens’ Uncertainty Test.
SVMR should be considered when it is known a priori that non-linearity will affect the
system and attempts should be made to find a kernel function that best handles this.

3.4.10 How to interpret regression results


Once a regression model is built, one needs to to diagnose it, i.e. assess its quality, before
interpreting the relationship between X and Y. Finally, the model will be ready for use for
prediction once it has been thoroughly checked and refined.
The various types of results from MLR, PCR and PLS regression models and more information
about the interpretation of projection results (scores and loadings) and variance curves for
PCR and PLSR can be found in the corresponding chapters covering each method.
How to detect nonlinearities (lack of fit)
Different types of residual plots can be used to detect nonlinearities or lack of fit. If the
model is good, the residuals should be randomly distributed, and these plots should be free
from systematic trends.
The most useful residual plots are the Y-residuals vs. predicted Y and Y-residuals vs. scores
plots. Variable residuals and Normal Probability Plots can also be useful.
The PLSR X-Y Relation Outliers plot is also a powerful tool to detect nonlinearities, since it
shows the shape of the relationship between X and Y along one specific model factor.
What are outliers and how are they detected?
An outlier is an object which deviates from the other objects in a model and may not belong
to the same population as the majority and therefore can disturb the model.
The cause of outliers could be one or more of the following:

 Measurement error
 Wrong labeling

22
Overview

 Deviating products / processes


 Noise
 Extreme / interesting sample

For projection methods like PCA, PCR and PLSR, outliers can be detected using scores plots,
residuals, leverages and influence plots.
Outliers in regression
In regression, there are many ways for a sample to be classified as an outlier. It may be
outlying according to the X-variables only, or to the Y-variables only, or to both. It may also
not be an outlier for either separate set of variables, but become an outlier when one
considers the (X,Y) relationship. In the latter case, the X-Y Relation Outliers plot (only
available for PLSR) is a very powerful tool showing the (X,Y) relationship and how well the
data points fit into it.
Use of residuals to detect outliers
One can use the residuals in several ways. For instance, first use residual variance per
sample plot, then use a variable residual plot to detect samples with large squared residual
in the first plot. The first of the two plots is used for indicating samples with outlying
variables, while the latter plot is used for a detailed study for each of these samples. In both
cases, points located far from the zero line indicate outlying samples or variables.
Use of leverages to detect outliers
The leverages are usually plotted vs. sample number. Samples showing a much larger
leverage than the rest of the samples may be outliers and may have had a strong influence
on the model, which should be avoided.
For calibration samples, it is also natural to use an influence plot. This is a plot of squared
residuals (either X or Y) vs. leverages. Samples with both large residuals and large leverage
can then be detected. These are the samples with the strongest influence on the model, and
may disturb (influence) the model towards themselves.
The features of two plots can be utilized by plotting influence and Y-residuals vs. predicted Y
together. Some example plots are shown below:
Scores plot showing a gross outlier

Y-Residual vs. Y-Predicted showing the presence of a potential Outlier

23
The Unscrambler X Main

Leverage plot showing the presence of a potential outlier

3.4.11 Guidelines for calibration of spectroscopic data


The information described in this chapter so far has presented the basics of calibration. The
following steps and useful functions may be used as a guideline for the development of
spectroscopic calibration models.

Preparing data for analysis


Read data
File - Open or File - Import Data. Data can be imported from many vendor
instrument formats — directly or via e.g. JCAMP-DX, GRAMS SPC or ASCII.
See full details on compatible formats in the chapter on Importing data
View and prepare data
View data as a spreadsheet in the Editor, define sets using the Define Range option.
Select some samples and Plot - Line or Matrix to get an overview of the spectra
(data plot). Histograms of Y-variables are also useful to assess the spread of the data
for calibration. 3-D scatter plots can be used as an initial assessment of any
covariance between numerous constituents, if there are several present in the

24
Overview

analysis. All of these plots can be helpful in detecting outliers, or possible errors in
the data.
Note: It is advisable to aim for a boxcar distribution of Y-values, as this provides the
most even coverage of the region of interest.
Preprocess (transform the data)
Tasks - Transform… allows for spectroscopic transformations, derivation,
smoothing, etc. Tasks - Transform - Reduce (Average) may also be useful when
replicates have been measured, or variable reduction is required. The Preview Result
option in the transform dialog, provides a graphical preview of spectral data as
transform parameters are changed. These changes are presented to the user in real
time.
Statistics
Tasks - Analyze - Descriptive Statistics… may be used to reveal scatter effects and
for visually detecting large changes in specific wavelength regions. Use the Scatter
option to reveal potential scatter effects before the application of transforms such
as Multiplicative Scatter Correction (MSC).
Select samples
The Edit - Mark option is useful for selecting a more balanced data set from a large
data set from PCA, PCR or PLSR scores. This can be applied to either the spectra or
the constituents (if more than one component is being analyzed). Mark samples that
span all the important components (samples far away from the origin, including the
extremes when selecting calibration samples). Use the Create Range option to
extract marked samples as a new row set in the project navigator.
Reduce spectra
Use the Tasks- Transform- Reduce(Average)… options to reduce spectra of high
data point spacings (being careful not to lose resolution) to fewer data points, or
average out replicate spectra in a data set.

Calibration and fine-tuning of models


Make a first calibration model and look for outliers
Tasks - Analyze - Partial Least Squares Regression… with more than one response
variable (Y) gives a simple overview for several constituents. Otherwise run PLSR
with a single response, or PCR or MLR, which use only a single response. View the
results, especially Variance plots, Scores and Predicted vs. Reference plots. Use Edit -
Mark (also available as right mouse button option) to mark suspicious samples in the
scores plots. Use Plot - Sample outliers and XY Relation outliers to investigate
potential outliers.
Refine the Model
After marking samples one can go to the analysis (i.e. PLSR) node in the project
navigator and right click to select Recalculate - Without Marked, which allows the
calculation of a new model with the marked samples removed. Compare results, and
look for additional outliers. Repeat this process if necessary.
Study the model in detail
Plot the results including Variances and RMSEP - RMSE, Important variables,
Predicted vs. reference, loadings as these are useful tools for assessing model
quality. View the regression lines and statistics in the predicted vs. reference plot, as
these are helpful for assessing the model fit. Highlight samples in scores plots by
groups using the Sample grouping available as a right mouse button option, for

25
The Unscrambler X Main

investigating interesting patterns in the data. View the loadings as line plots and see
if the variables of importance coincide with the spectral regions related to the
property being measured.
Delete variables (wavelengths).
From the Important variables plot the Edit - Mark option can be used to define
ranges in the spectra that are not important (potentially due to noise). Use the
Recalculate - Without Marked option to generate a new model based on fewer
wavelengths. Apply the Uncertainty test during PLS regression to aid in the
identification of important variables for modeling.
Validation
It is essential to ensure that a developed model is properly validated using a suitable
validation method (cross validation or test set validation). Cross validation can be set
up to look at the effect of removing an entire set of replicates from an analysis or
single replicates can be removed to test the predictive ability of the model for single
replicates.

Deploying models in real world applications


Access to results
All of the models that have been created in a project are stored as analysis nodes in
that project and can be accessed from the project navigator. The Save Model option
can be accessed by right clicking on an analysis node, allowing one to save the model
as an independent file from the project. This allows the models only to be shared
with others and not the entire project. The models can be used in real-time via The
Unscrambler® Process Pulse, and with The Unscrambler® Predictor/Classifier
(OLUP/OLUC). It is also the way The Unscrambler® Online Predictor/Classifier will
use models for online and 3rd party applications. More on this is discussed in the
Instrument Compatibility section below.
Detailed information about the model is stored in the results and validation folders under a
particular analysis node. A summary is available in the Info box in the lower left part of the
display, when the model name is highlighted.
Predict new samples
Tasks - Predict - Regression… is used to predict Y-values for new unknown samples
from spectra. If new samples have known reference values available, these can used
in the Predict option to assess the quality of new predictions during the validation
stage of model development. The prediction also provides the uncertainty of the
measurements and additional statistics to show the similarity of the prediction
samples to the calibration samples. Reproducibility can also be assessed in terms of
samples measured on different instruments, or from different manufacturing sites,
etc by applying a model developed on one spectrometer to spectra scanned on
another instrument. Remember to preprocess new samples in the same way as the
original calibration samples used to develop the model (which can readily be done
using Autopretreatments).
Check the robustness of calibration models
By using Tasks - Transform - Noise various amounts of additive or multiplicative
noise can be added to new samples to see how sensitive the model is to small
changes. In the project navigator, under the Validation folder, the Prediction
Diagnostics matrix is available for regression methods. Assess the numerical values
of all results, checking that bias is close to 0 and slope is close to 1. Otherwise there
may be a need to slope and bias adjust the predicted Y-values (e.g. the spectra may

26
Overview

exhibit slight differences on one instrument compared to another, or there may be


systematic differences in the reference values from another laboratory). SEPcorr
provides a bias corrected SEP value, i.e. the expected predicted error in the absence
of systematic bias.
Audit Trails
The Tools-Audit Trail… option provides a non-editable record of all imports, analyses
and manipulations made to a project. It is especially useful in regulated
environments requiring compliance to 21 CFR part 11. All saves and project entries
are also recorded in the audit trail.
When predictive models have been optimized to meet certain desirable criteria, i.e. the
predictive ability on new samples is satisfactory, these models may be used in third party or
The Unscrambler® based applications, such as The Unscrambler® Online Predictor/ Classifier
and The Unscrambler® X Process Pulse.

Instrument compatibility
Some instrument vendors (for example Perten, Brimrose, Guided Wave, Foss NIRSystems,
Thermo, etc.) make use of The Unscrambler® Online Predictor/ Classifier software available
for integration of The Unscrambler® models into third party systems. These packages are
DLL-based programs that are incorporated into the instrument software, allowing the use of
The Unscrambler® predictive or classification models on the data, providing the model
results to the instrument interface for either graphical or numerical display when a new
(spectral) measurement is made. Visit http://www.camo.com/ for more information on
these applications.
The Unscrambler® X uses the Save Model option to save predictive, or classification models
as separate files from a project. The Unscrambler® Generation X family of online software
uses these model files directly for applications. The Unscrambler® X is backward compatible
for use in previous versions of The Unscrambler® Online Predictor and Classifier (back to
version 9.2). Use the File-Export-Unscrambler option to export model files for use in these
previous versions. This option will allow users to save data or model for backward
compatibility. Contact CAMO for this plug-in option.
Some instrument software can read the B vector (regression coefficients). Use File - Export -
ASCII…, or JCAMP-DX. Use File - Export - ASCII MOD… , which is a simple file format
containing all information necessary to make predictions, either using full PLSR or PCR
models, or just the B vector. It can be used with user-defined conversion routines.
Use The Unscrambler® to develop models for instruments that do not support The
Unscrambler® Online Predictor/Classifier
If an instrument vendor software does not support The Unscrambler® developed
models, import the instrument data as a common format, i.e. ASCII Excel, JCAMP etc
and develop a model using the powerful diagnostic and algorithmic capabilities. Use
this model to select appropriate calibration and validation samples, determine the
optimal PCs/factors to use and match the preprocessing to the options available in
the vendor software. Redevelop the model in the vendors’ software and compare
the two results. This will provide added assurance that the developed model is
robust and performs as required.

 The various residuals and error measures are available for each PC in PCR and PLSR,
while for MLR there is only one of each type

27
The Unscrambler X Main

 There are two types of scores and loadings in PLSR, only one in PCR

3.5. Demonstration video


Watch this video to become familiar with the new user interface in The Unscrambler® X.
The video provides a guided tour of some of the basic operations in the software
application. This will show the project-based structure of The Unscrambler®, how to import,
view and analyze data. The video gives an overview of using the project navigator which
incorporate raw data, transformed data, and all the results of analysis within a given project.
Note: This video was created using The Unscrambler® X version 10.0. The current
version of the software has a slightly different look and feel and even more
functionality.
An Internet connection and Adobe Flash Player is required to play the above video.

28
4. Application Framework
4.1. User interface basics
The purpose of this chapter is to give the user an overall introduction to the principles used
in The Unscrambler®. A short overview of The Unscrambler® user interface and workplace is
provided in this section, covering the various menu options, and the data organization
environment:

 Getting to know the user interface


 Matrix editor
 Project navigator

Menu walk-through:

 File
 Edit
 View
 Insert
 Plot
 Tasks
 Tools
 Help

General dialogs usage, by menu:


File

 Import data >


 Export >
 Print…

Edit

 Find and replace


 Go to…
 Change data type – Category…
 Define range…
 Group rows…
 Sample grouping…

Insert

 Data matrix…
 Duplicate matrix…
 Custom layout…

Tools

29
The Unscrambler X Main

 Matrix calculator…
 Report generator…
 Audit trail…
 Options…

Help

 Modify license…
 User setup…

4.2. Getting to know the user interface


This will introduce terminology related to the user interface in The Unscrambler®. It is
assumed that the user is already familiar with using the operating of his computer.

 Application window
 Workspace
 Editor
 Viewer
 Project navigator
 Project information
 Page tab bar
 The menu bar
 The toolbar
 The status bar
 Dialogs
 Setting up the user environment
 Getting help

4.2.1 Application window


The application window layout is composed to give an overview of the work currently being
done.
The below screenshot shows the application with its menu bar, toolbar, the project
navigator and project information panes on the left, the workspace in editor mode mode
(there is also a viewer mode), and the page tab bar below it. The status bar at the bottom
shows a summary of the selected content and status while The Unscrambler® is calculating.
The Unscrambler® main window

30
Application Framework

4.2.2 Workspace
The Workspace occupies the largest area of the application window, containing either a
table view of a data set, called the Editor, or a Viewer which displays results either
graphically as plots or numerically as tables.
Editor
The Editor presents a data table that may or may not be modified depending on its
protection status:
If a table can be edited, it is possible to:

 Type in values.
 Change the column and row headers.
 Create ranges.

More info on organizing data.


Viewer
In the Viewer, data and results are visualized graphically in an interactive manner.
Whenever data are plotted, the plot appears in a Viewer. Every time the Viewer is
mentioned throughout this manual and help system, it refers to a window where a plot is
displayed.
The information in the viewer can come from:

 Plotting raw data from the editor: either for a data matrix or a matrix from a result.
 Displaying predefined plots.

31
The Unscrambler X Main

 Custom layout.

To learn more about working in this mode, please refer to the chapter on plotting data.

4.2.3 Project navigator


The project navigator is a tree-like structure consisting of data matrices and analysis results
along with plots.
All raw and modified data sets along with different analysis results and plots can be stored
as a single project. One can toggle between different data sets and analysis results just by
selection.

4.2.4 Project information


The Project information pane, found in the lower left corner of the display has two tabs:
Info and Notes.
Info
Include details about the currently selected item in the project navigator, such as
the matrix or model name, matrix shape, creation time and type of input,
parameters used for output matrices, plots and results.
Notes
Annotations are saved in notes.
More information about a project are found in the audit trail.

4.2.5 Page tab bar


At the bottom of the Workspace there is a list of recent views. It acts as a “breadcrumb trail”
of what has been viewed recently.
When reopening a file, only the most recently active view will be available.
By right clicking on a tab and selecting Pop out, the item becomes a separate window, that
can be moved around and placed as a side-by-side view.
It is also possible to close the current tab, all other tabs or all tabs via this menu.

4.2.6 The menu bar


All operations in The Unscrambler® are performed with the help of the menus and options
available in the menus.
Available menu actions will change depending on context; Editor or Viewer mode, or the
currently selected plots. Some submenus and options may be invalid in a given context;
these are grayed out.
Context-sensitive menus
The Unscrambler® also features so-called context-sensitive menus. These can be accessed by
clicking the right mouse button while the cursor rests on the area on which an operation is
performed. The context-sensitive menus are a kind of shortcut, as they contain only the

32
Application Framework

options which are valid for the selected area, which will save a user the work of having to
click through all the menus on the Menu bar.

4.2.7 The toolbar


The Toolbar buttons provide shortcuts to the most frequently used commands. When the
mouse cursor is rested on a toolbar button, a short tooltip explanation of its function
appears.

4.2.8 The status bar


The Status bar at the bottom of the screen displays concise information including:

 Computations currently in progress.


 Short explanation of the current menu option.

On the right-hand side, additional information is displayed, such as

 the value of the currently selected entry, and


 the size of the data table.

4.2.9 Dialogs
The Unscrambler® aims to aid the user through dialogs that provide detailed instructions to
the application.
When working in The Unscrambler® the user will often have to enter information or make
choices in order to be able to complete an analysis. This includes activities such as specifying
the names of data matrices/files to work with, the data sets to analyze, how many PCs to
compute, or the type of validation methods to choose. This is done in dialogs, which will
normally look something like the one pictured below.
The Unscrambler® dialog

33
The Unscrambler X Main

This particular dialog is the one associated with running a Principal Component Analysis on
data. Items that are predefined, such as rows/samples, columns/variables, etc. are selected
from a drop-down list. Options which are mutually exclusive are selected via radio buttons.
The settings for many of the analysis dialogs will be remembered from the last time the
dialog was open.
Any dialog can also be canceled by pressing the Esc (escape) key on the keyboard. Ongoing
calculations can also be aborted pressing Esc.

4.2.10 Setting up the user environment


The Unscrambler® provides user authentication to offer traceability required by regulations.
See the documentation for the Login dialog for how to make use of this facility, and set up a
user.
The look and feel of the workspace can be customized. See the documentation for the Tools
– Options… dialog for more information.

4.2.11 Getting help


Documentation for currently open dialogs can be accessed by pressing F1, or by using the ?
button near the top right corner of the active dialog window.
See How to use help and the Help menu for more details.

4.3. Matrix editor basics


This is an introduction to the matrix editor.

34
Application Framework

 What is a matrix?
 Matrix structure
 Samples and variables
 Adding data matrices
 Manually
 Drag and drop from other applications
 Altering data tables
 Using ranges
 Create ranges to organize subsets
 Superimposed ranges
 Storing data as separate matrices
 Data types
 Possible data types
 Converting data types
 Keeping versions of data
 Saving data

4.3.1 What is a matrix?


A matrix is a rectangular table of numbers.
The horizontal lines in a matrix are called rows and the vertical lines are called columns. A
matrix with m rows and n columns is called an m-by-n matrix (or m×n matrix) and m and n
are called its dimensions.
The places in the matrix where the numbers are, are called entries. The entry of a matrix A
that lies in the row number i and column number j is called the i,j entry of A. This is written
as Ai,j or aij.
Matrix structure
The matrix A with M rows and N columns is defined as A(M,N) and can be represented as
shown below.
A11 A12 A13 … A1N

A21 A22 A23 … A2N

A31 A32 A33 … A3N

… … … … …

AM1 AM2 AM3 … AMN


Matrices consisting of only one column or row are called vectors, while higher-dimensional,
e.g. three-dimensional, arrays of numbers are called tensors. Matrices can be added and
subtracted entry wise, and multiplied according to a rule corresponding to composition of
linear transformations. For more details on operations possible using matrices look into the
Matrix calculator
Samples and variables
A matrix represents the values associated to samples and variables. An entry corresponds to
the value of a specific sample for a specific variable. The general way of presenting data in a
matrix is to place the samples in row and the variables in column.
Variable 1 Variable 2 Variable 3 … Variable N

35
The Unscrambler X Main

Sample 1 A11 A12 A13 … A1N

Sample 2 A21 A22 A23 … A2N

Sample 3 A31 A32 A33 … A3N

… … … … … …

Sample M AM1 AM2 AM3 … AMN

4.3.2 Adding data matrices


To create a data table in The Unscrambler®, there are three options:

 Create a data matrix


 Create a design table
 Import data

See insert matrix dialog box for more information on how to create a blank table, fill it with
data and rename it.
Manually
Enter data manually into a matrix by simply typing while an entry is focused, double clicking
on a specific entry, or pressing F2 and entering the value. This operation can be done for the
data table as well as the sample and variable name.
Category entries have a drop-down list, allowing the user to select one of the levels already
used. It can also be typed, and it is possible to type anything to add new levels.
Date-time entries have a calendar pop-out, allowing the user to pick a date from it.
Drag and drop from other applications
Data can be copied from any application, e.g. Microsoft Excel, to The Unscrambler® by either
drag and drop, or by copy and paste.
Files can also be dragged from the file manager onto The Unscrambler® application window.
The window title bar is a good drop target.

4.3.3 Altering data tables


It is possible to move focus between entries using the arrow keys. Hold shift to select a
range of entries.
Press Del to delete the contents of an entry.
Use Ctrl or Shift when clicking on row or column index numbers to select more than one row
or column: Ctrl+click will add the clicked index to the selection, while Shift+click will add all
rows and columns up to the clicked index.
Columns and rows can be moved by selecting them and grabbing the selection border. Drag
and release the mouse button on the target column or row where it will be moved.
Hold the Ctrl key while doing this to make a copy of the selected column or row.

36
Application Framework

4.3.4 Using ranges


When collecting data, one may gather information on a sample from different sources, for
example a spectrum and some chemical measurements, or some process data and some
quality measurements.
In the same way one may have several types of samples: the ones that will be used for
model calibration and the ones to be used to validate the model.
There are different options to store the data in The Unscrambler®: either collect the
information in the same data table or use different matrices within the same project.
Create ranges to organize subsets
It is often useful to create subset of either samples or variables to make them easily
accessible from the different plotting and analysis dialogs. This is done by defining ranges. A
quick way to start is to select a part of a data table and right click to select the option Create
Range.
The created range will be displayed in the project navigator and can be renamed to allow for
easier identification later. The color box next to the range node connects the range visually
to the corresponding entries the matrix editor.
Each subset of the matrix will be displayed separately in the matrix editor by selecting a
range in the project navigator.
More sophisticated options for working with ranges are available in the Define Range or
Scalar and Vector dialogs.
When ranges have been created in a matrix, they can be copied to another matrix of the
same dimensions. Right click on the matrix node in the project navigator and select Range -
Copy Range. The right-click option Range - Paste Range can be used to apply the same
ranges to a new matrix of the same dimensions (rows or columns).
Superimposed ranges
A region comprises a row range and a column range, thus selecting entries spanning multiple
rows and columns will result in two ranges, one for each axis.
These ranges are independent of each other and can be used in conjunction with any other
range.

This above case is typical of creating two set of variables: X (predictors) and Y (responses),
and two sets of samples for calibration and validation.
Storing data as separate matrices
In The Unscrambler® one can use different matrices in the analysis as long as they are
compatible in size and stored in the same project.
Hence one can store data in several matrices that will appear in the project navigator as
illustrated below:

37
The Unscrambler X Main

4.3.5 Data types


Possible data types
Variables (columns) can have one of four available data types:
Numerical
A numerical variable is one that has numbers as values.
Category
A category variable is one that has two or more category levels. There is no intrinsic
ordering required and no distinction between nominal (e.g. male or female) and
ordinal (e.g. high or low) categories.
It is recommended to use words to label category levels to give each level meaning,
such as “High” or “Low”.
Categories are stored as text, each level is assigned a index. Use View – Level Indices
to display the integer value assigned to each level.
Category variables are kept out of calculations.
Text
Each value is a text string.
International characters are supported. The encoding used internally is UTF-8.
Maximum text length is 256 characters.
Text columns are kept out of calculations.
Date-time
Each entry is a date-and/or-time.
The displayed date format can be customized, see Tools – Options… menu.
Date-time variables are kept out of calculations.
In the matrix editor these are given colors to make it easy to identify different types of
variables.
Visualization of data types in the matrix editor

Explanation of default colors for data types


Data type Background Color Alignment

Numerical Right

Category Orange Left

Date-time Left

Text Blue Left

38
Application Framework

Missing data Gray

Selection Blue White

Converting data types


The data type of one or several variables can be changed by selecting them and using the
option Change data type in the Edit menu. Select one of the available data types from the
menu.

4.3.6 Keeping versions of data


When working with data, it is advisable to always keep the raw data unaltered. For
traceability and verification it is required. Keep in mind that when a transform is applied to
data matrix, a new matrix is created in the project, maintaining the original data matrix. At
appropriate steps in a workflow, use the option Insert – Duplicate Matrix… to take a
snapshot by replicating the matrix.
For more information see the duplicate matrix documentation.

4.3.7 Saving data


By default, all the project data, results, models and plots will be saved as a proprietary
binary format with the .unsb file name extension.
It is also possible to save just a matrix from a project, by selecting the matrix, right clicking,
and choosing Save Matrix. The given matrix is then saved as a file with extension .unsb and
can be opened as a separate project.
Other options are to use File – Export to export a selected data set in file formats that can
be opened with for instance Matlab or Microsoft Excel.
The default binary format will load and save faster, whereas the XML based format makes it
easy to create software for reading data saved by The Unscrambler®.
The Unscrambler® file formats supported:
Version File name extensions1 Compatibility

X .unsb,.unsx2 Read, Write

X-9.0 .AMO Write

9.8–9.7 .??M Read

9.8–9.0 .??[DLPTW] Read, Write3

 The file names are given in glob notation: ”*” mean any number of characters, ”?”
any character, “[ABC]” any of A,B or C.

 Support for XML is available via a separately installed export plug-in.

 Available via a separately installed export plug-in.

39
The Unscrambler X Main

4.4. Using the project navigator


This is a guide to the project navigator.

 About the project navigator


 Create a project
 Items in a project
 Browse a project
 Managing items in a project
 Actions common to all item types
 Actions for data table nodes
 Actions for results nodes

4.4.1 About the project navigator


The top node in the project navigator represents the project node. Only a single project can
be opened at one time. The project contains all of the data for a particular analysis, any
transformed (preprocessed) data, any models developed, and predictions or classifications
performed.
Models such as PCA or PLSR, or predictions using these, have their own special node icons
for better recognition of the types of analysis that have been performed.
When a user adds column or row sets to an imported data matrix, a new subnode is
displayed. This provides the user greater visualization of the structure present in a data
matrix and allows better tracking of modifications. This data organization also creates
subsets of the data that can be chosen for analysis and/or plotting.
When a user transforms the data in an imported, or generated matrix. The Unscrambler®
keeps the original data intact during transformation, and provides a new data matrix node in
the project navigator containing the transformed data.

4.4.2 Create a project


When The Unscrambler® is launched, it will display an empty project, ready to add data.
The Unscrambler® can not have more than one project open at a time, but each project can
contain many data sets and results.
To start a new project with another project opened, use the File – New menu. A prompt will
ask if the user would like to save the current project.
The first thing to do is to get data or a model into the project. Do that by:

 Adding a data matrix.


 Creating a design matrix.
 Importing data.
 Importing models.

40
Application Framework

4.4.3 Items in a project

In the project there are three types of items:

 Matrices
 Plots
 Results: Each analysis will create a new node containing model or prediction details

The items are organized as nodes that create a tree.


Generic icons used for the project navigator nodes
Node symbol Description

Project top node

Data set

Plot

Data set range shown with its respective color

Outlier warnings list

4.4.4 Browse a project


The project navigator is a useful way to navigate, browse and access data sets, result
matrices, plots and visual presentations of results.
Note: It is possible to collapse (-) and expand (+) the folders to hide or show their
content.
To select an item click on it. It will be displayed in the workspace.

4.4.5 Managing items in a project


There are different right-click menu options available for the different item types in the
project navigator. These are described in the following.

41
The Unscrambler X Main

Actions common to all item types


Plot node menu

Rename
Rename the node
Delete
Delete the node. This operation cannot be undone, so use with caution. This action
has to be confirmed in a pop-up dialog in order for the node to be deleted.
Actions for data table nodes
Data table node menu

Transform
Shortcut all the pretreatment available in the Tasks – Transform menu.
Plot
Shortcut to all the plots available in the Plot menu.
Export
Export the data using one of the supported external data formats.
Range
The Range option allows the following actions to be performed
Define Range allows the definition or row and column ranges and special intervals in
a data set. For more information see the Define Range dialog.
Copy Range Copy the selected ranges (rows or columns) to another matrix of the
same dimensions
Paste Range Paste copied ranges into the same or another matrix of the same
dimensions
Duplicate Matrix
This will create a new copy of the data matrix in the project
navigator. It is a shortcut to the Insert - Duplicate Matrix
(Insert – Duplicate Matrix…) option.
Spectra
Define a selected columnset to hold spectral data, in order to change the default
view of certain model result plots (e.g. PLS regression coefficients plotted as line in
Regression Overview, or X-loadings plotted as line in PCA Overview).
Save Matrix

42
Application Framework

Save the selected data or result node to a new project file.


Scalar and Vector
Open the Scalar and Vector dialog in order to add scalar/vector tags to column-sets,
along with units and range information. This is useful for quality control in an online
process.
Actions for results nodes
Result node menu

Recalculate
Rebuild the model with the following changes

 With Marked… (samples or variables)


 Without Marked… (samples or/and variables)
 With Marked Downweighted… (variables only)
 With UnMarked Downweighted… (variables only)
 With New Data… (samples only)

See more details about recalculate options here


Register Pretreatment
When a model has been built using transformed data, all the transformations will be
selected for automatic pretreatment in case the model will be used for prediction of
new samples. In some cases the new data may have been pre-processed manually
before prediction. Use this dialog to define which transformations to be applied on
future prediction samples.
Hide/Show plots
Hide/Show the model folder containing the predefined result plots.
Save Model
Save the selected model in a new project file, as described here.
Set Components
Change the default number of components to use for prediction, as described here.
Set Alarms
Open the Set Alarms dialog to set warning and alarm limits for input or output data
of individual models. Can be applied in CAMO’s online engines for prediction,
projection and classification.This is useful for quality control in an online process.
Set Bias and Slope
Bias and slope correction is used as a post-processing step to achieve an offset (bias)
of 0 and slope of 1. This option will be available only for MLR, PCR and PLS
regression models.

43
The Unscrambler X Main

4.5. Register pretreatment


Use this dialog to store a given set of transformations applied when building a model for
reuse in prediction.

Registered transformations will be automatically applied to input data before running a


prediction, projection or classification.
Normally, the preference is to keep all transformations applied to the training data set
selected, so that prediction data are given the exact same treatment. If not the model may
be invalid, as input data will not be in the shape expected by the model.

4.6. Save model for prediction, classification


This option allows one to save the model (results) as a separate project (smaller file). There
are several options for the results file. Depending on what option is used, the file size can be
reduced so that they are best suited for usage in prediction and/or classification. These
models can be used also with the Unscrambler Prediction Engine, Classification Engine, and
Unscrambler Process Pulse. Select a model in the project navigator and right click to select
Save Result.

44
Application Framework

In the dialog, one has the option to save several different types of model files. These smaller
model files do not support the plots, and do not include the raw data and some of the
validation matrices that are present in the entire model. The prediction (or classification)
results that can be computed depends on the type of model that is saved.
Entire model
this saves all the results and supports all visualizations that are available when a
model is developed in The Unscrambler® X. This option also permits recalculation of
the model by keeping out any selected data. This option is available for MLR, PLS,
PCR and PCA models.
Prediction
The prediction result options saves the model in smaller files, as the model result file
does not include many of the results matrices including the validation results and
other matrices used in the prediction visualizations.
 Full with support for inlier detection: The model result file does not include the
following matrices: Y scores, Beta coefficients (weighted), Variable leverage, X
Correlation loading, Y correlation loading, Square sums, and Rotation. Three of the
validation matrices are saved in this model format: X total residuals, X value
validation residuals, and Y value validation residuals. This model can be used for
prediction, giving all the results that The Unscrambler® computes on prediction,
including the deviation.
 Full: This model results file allows one to predict new values, and get the deviation
with that value, as well as to detect outliers (based on Hotelling’s T2 and Q
residuals). With this model, inliers cannot be computed during the prediction stage.
The Hotelling’s T2 and Q residual limits and X values are computed, but not plotted
during prediction with the Full model. Compared with the entire model, this version
saves 11 of the 20 validation matrices. It does not compute the Inlier limit and the
Sample inlier distance, nor the seven matrices that are saved with the Full (with
inlier detection) prediction result.
 Short: In the short model, only the raw beta coefficients are saved, at the optimal
(or user-defined) number of components. No validation matrices are saved. With a
short prediction model, one can get the predicted results for new data, but no other

45
The Unscrambler X Main

distance measure, or deviation measure. No comparison between known and


predicted values can be made when using a short prediction. A short prediction
model is not recommended if one would like to have model and/or sample
diagnostics during the prediction step.
Classification
PCA, PCR and PLS models can be saved for use only for classification. These models
cannot however then be used for regression. This result option saves the
information from the model needed to apply this model for classification. It is a
smaller file, and contains only the results and validation matrices needed to perform
classification on new samples. The saved results matrices are for a PLS classification
model are: X means, X weights, X loadings, scores, and Loading weights. The PCA
classification model does not include plots. The results matrices with the PCA
classification model are: X means, X weights, X loadings, and scores. The validation
matrices saved in this model format are: X Variable Residuals, X Variable Validation
Residuals, X Sample Residuals and X Sample Validation Residuals. A model of type
classification can be used with OLUC X.
Number of components
A model will be saved with all the components that have been computed for it,
unless specified otherwise (and for a short model, which will be saved for the
optimal number of components by default). The user can specify the number of
components to save with a given model. This can be more, or less than the optimal
number of components for a given model.

4.7. Set Alarms


User can set alarms during model development that can be useful during prediction,
classification and projection for new samples. Two warning limits (high and low) and two
alarming limits (high and low) can be set for the available results and validated matrices
calculated from PCA, MLR, PCR and PLSR. The values entered here serve as warning and
alarm thresholds. The alarm values can be entered in standard or scientific notation.

4.7.1 Prediction:
This will be enabled only for Regression techniques (MLR, PCR and PLSR). Low and high limits
can be set for Deviation and Scores matrices; and so for each one of Y responses . Only high
limits can be set for Hotelling’s T², Sample Leverage, X Sample Q-Residuals and Validation
Residuals. For Explained X Sample Validation Variance, low limits can be set.
Set Alarm States for output matrix of Prediction

46
Application Framework

4.7.2 Classification:
Only high limits can be set for X Residuals, Si/S0 and Leverage matrices that will be used for
classifying new samples for models developed from PCA, PCR and PLSR.
Set Alarm States for output matrix of Classification

4.7.3 Projection:
Scores matrix provides the option to set low and high limits. For Hotelling’s T², Sample
Leverage and X Sample Q-Residuals matrices only high limits can be set. For Explained X

47
The Unscrambler X Main

Sample Validation Variance, low limits can be set. Projection for new samples is available
only for models developed from PCA, PCR and PLSR.
Set Alarm States for output matrix of Projection

4.7.4 Input:
This feature helps user to understand whether the inputs are from one or different sources.
If user has already defined the columnset matrices using Scalar and Vector dialog, those will
be listed for selection. Alternatively, the Define button would open the Scalar and Vector
dialog for defining limits for columnset matrices.
Set Alarms for input matrix

48
Application Framework

4.8. Set Components


Use this option to set the number of components for a model to a value other than the
optimal recommended number. This number of components will then be used when the
model is used for prediction and/or classification.

4.9. Set Bias and Slope


Bias and slope correction is sometimes used as a post-processing step to achieve an offset
(bias) of 0 and slope of 1. This may be useful e.g. if samples measured on a different
instrument give consistently different predictions than samples measured on the same
instrument as the calibration data. If successful, this means that the same model can be
used to predict properties of samples measured on different instruments. Caution is
required however, as any bias and slope estimation will be associated with a risk of
overfitting, and there is no guarantee that the prediction error for future samples will

49
The Unscrambler X Main

improve. Despite the risks, bias and slope correction has been proven useful in some
industries such as the agricultural sector.

4.9.1 Algorithm
Bias and slope correction is performed on prediction data Yhat by subtracting the slope and
then divide by the bias: Yhat_corrected = (Yhat – bias)/slope
The bias and slope estimates in the above equation can be taken directly from a test set
validated Predicted vs. Reference plot, or they can be input manually by user. Default values
when not explicitly specified are bias=0 and slope=1.

4.9.2 Menu option


User can set bias and slope during model development that can be useful during prediction
for new samples. Select a Regression (PCR, PLS, MLR) model in project navigator and right
click to select Set Bias and Slope

4.9.3 Usage

In the dialog, user has the option to check the Apply Bias and Slope correction. When
checked, model will perform bias and slope correction during prediction based on any of the
below selected options.
 Re-calculate from Prediction data: When selected, the bias and slope correction
factors will be the offset and slope, respectively, as taken from the ‘Predicted vs.
Reference’ plots for the new prediction data. The underlying assumption is that any
differences in bias and slope between the calibration and prediction data are due to
systematic and repeatable differences between the instruments used to collect the

50
Application Framework

two data sets. If used indiscriminantly this may decrease the actual prediction
performance and the option should therefore be used with caution. When selected,
reference Y data are mandatory in prediction.
 Set or apply default correction factors: With this option default correction factors
based on the calibration model are suggested. For test-set validated models these
are the validation Offset and Slope values of the ‘Predicted vs. Reference’ plot,
under the assumption that the test set data are measured on a different instrument
that is representative also for future predictions. For leverage and cross-validated
models this assumption cannot be met and the default bias and slope is therefore 0
and 1, respectively. The user is free to manually change the default values, in which
case a message will be displayed that the values have been manually edited. A Reset
button will revert the bias and slope correction factors back to the default values.

4.10. Login
Two modes of operation are available in Unscrambler

 Compliance Mode- This is the recommended installation procedure for companies


that need to comply with the regulations of 21 CFR Part 11 (electronic signatures).
 Non-Compliance Mode- Recommended for users and industries that do not require
electronic signature authentication and audit trailing.

The choice of installation procedure and internal program setup determines what level of
login is required by a user. This is described further in the following sections.

4.10.1 Non-Compliance mode


When The Unscrambler® is installed in Non-Compliance mode, the first time the program is
started, the Guest login screen is displayed,
Guest Login, Non-Compliance Mode

The Guest login requires no password or definition of a user group domain, so by clicking on
Login a user is entered into the program.
In Non-Compliance mode, a user name and login password can be setup from the Help -
User Setup menu.
If a user name and password have been set up, when a user attempts to login to the
program, a dialog similar to the one shown below is provided,
Login with defined User Name and Password, Non-Compliance mode

51
The Unscrambler X Main

In this case a user called User 1 was setup. This time, a password is required to enter the
software. If a user forgets their password, the Forgot? option should be selected. This is
described further in the next section.
Password reminders
It is possible to click Forgot? next to the password entry for a password reminder question
that is configured during user setup.
Password recovery dialog

In this dialog, a user is required to enter the correct answer to the security question and are
then required to enter a new password (with confirmation).
If the wrong answer to the question is entered, the following warning will be provided,

Solution - Enter the correct answer to the security question to proceed.


If the new password has not been entered the same way in the confirmation box, the
following warning will be provided,
Incorrect password confirmation warning

52
Application Framework

Solution - Be sure to enter the new password twice correctly.

4.10.2 Compliance Mode


When The Unscrambler® is installed in Compliance mode, it uses the Windows
Authentication details of the user logged into the computer that is being used for the
analysis. There are two options available during the installation and setup of the program,

 Set up compliance mode with Login dialog shown each time the program is started
 Set up compliance mode with a hidden Login dialog

System enforced login


When the installation is performed such that a user is required to login to The
Unscrambler®, a dialog similar to the one shown below is provided.
Windows Authentication login

The users windows name is shown in the login screen. To enter the program, the user must
enter their windows password.
Automatic entry
When the program is installed in compliance mode, but the Hide login screen option is
chosen, when a user starts The Unscrambler® they are automatically logged into the
program and the windows authentication details are used in the Audit Trail.
This authentication method takes advantage of centralized user management features used
in regulated network configurations, instead of redefining the user names.

53
The Unscrambler X Main

For more information on how The Unscrambler® security features help a company to comply
with the requirements of 21 CFR Part 11, please have a look at the Statement of compliance

4.11. File
4.11.1 File menu

File – New

or Ctrl+N
This option is used to create a new project.
A new, blank workspace is created with a single node entry in the project navigator named
“New Project”.
See organizing data to get started adding data to a project.

File – Open…

or Ctrl+O
This option opens an existing project, using a regular file selector dialog.

File – Close

or Ctrl+W
This option closes the current project file. If changes to the project have not been saved, The
Unscrambler® prompts the user to save the project before closing it.

File – Import Data

This option allows the import of data from an external data file. This may be data from
another project file, an earlier version of The Unscrambler® or one with a different format,
e.g. Excel, ASCII, or data files from instrument formats.
For more information see the importing data documentation.

File – Save

or Ctrl+S
Saves the currently open project file.

File – Save As…


Save the current project in a new location or with a different file name.
The Unscrambler® will save projects using a proprietary binary format with the .unsb file
name extension.

File - Save Matrix/Model


Depending on whether a user is in the Editor or Viewer mode, an option to save the matrix
or the model to a location separate from the project is available.

54
Application Framework

File – Export

This is a menu option which allows one to export all or selected parts of a data matrix to an
external file, in one of the available export formats.
For more information see the exporting data documentation.

File – Print…

or Ctrl+P
This will open the Print dialog, where the user selects settings to print the current document
to a printer or file.
For more information see the print dialog documentation.

File – Security
The Security function contains two options, Protect and Sign.
Protect

This command enables a user to protect a project with a password. Whenever this project is
accessed, the user will need to provide the password to open it. A project file can also be
Unprotected by using the command File-Unprotect, and entering the correct password.
Note: The password must be remembered! If it is lost, the project cannot be opened again
Sign

For a more detailed description on how The Unscrambler® implements Digital Signatures,
click here
The Security feature is part of the overall data integrity and compliance capabilities of the
software, which also includes Windows Authentication and Audit Trails.
For more details on how The Unscrambler® meets the requirements of digital and electronic
signatures, please refer to the section on Data Integrity and Compliance

File – Recent

The list of recently opened projects is displayed. One can toggle different projects upon
selection.

File – Exit
This allows one to quit The Unscrambler®. If any project files have been changed since the
project was last saved, there is a prompt asking if changes are to be saved.

4.11.2 File – Print…


This will send the currently viewed plot or data table to a printer.

55
The Unscrambler X Main

Plots are scaled to fit within the margins set for the designated paper size and will retain the
same aspect ratio as is seen on the screen.
Data tables will normally print with 50 rows and 6 columns per page, depending on the
numeric format and font settings. Row and variable names and numbers will be included on
each page.
Print options from The Unscrambler® works as in any Windows application, where the user
selects printer, paper size, orientation, margins, etc.:

What can be printed


One may print either the current plot, or all plots. Select Current Plot to print out only the
currently active plot on screen; select All Plots to print out all plots currently shown on
screen.
In the field Print range designate what to print by selecting the appropriate radio button.
The print range applies to the current window in the Workspace. Use Selection if a range in
the current window has been selected to print.
Note: There must be a file open (in the Editor or the Viewer) to have access to this
option.

Printing several plots


The Print dialog for plots offers the possibility to print either the Current plot, or All Plots.
Select Current Plot to print out only the currently active plot on screen; select All Plots to
print out all plots currently shown on screen.
Select the printer to use from the Printer drop-down list.
The properties of the printer can be viewed by pressing Properties. See the operating
system documentation or printer manual for information on setting up the printer.
Information can be printed to a file by clicking on the Print to file box.

56
Application Framework

Print preview
It is a good idea to preview a document before sending it to the printer. Print preview
provides a look at how the pages will look when they have been printed. The option is only
available if a file is currently open.

4.12. Edit
4.12.1 Edit menu
The Edit menu has three different modes, and the displayed options depend on which part
of the application window is active at any given time. There are separate modes for the
workspace editor and viewer as well as for the project navigator. Some menu items are
common for two or three modes.

 Common actions
 Edit – Undo
 Edit – Redo
 Edit – Cut
 Edit – Copy
 Edit – Paste
 Edit – Delete
 Navigator mode
 Edit – Rename
 Edit – Spectra
 Editor mode
 Edit – Copy with Headers
 Edit - Insert Copied Cells
 Edit - Append Copied Cells
 Edit - Reverse
 Edit - Convert
 Edit - Fill
 Edit – Find and Replace
 Edit – Go To…
 Edit – Select
 Edit – Sort
 Edit – Append
 Row(s)/Column(s)…
 Category Variable…
 Edit – Insert
 Row(s)/Column(s)…
 Category Variable…
 Edit – Split Text/Category Variable
 Edit – Change Data Type
 Edit – Scalar and Vector
 Edit – Define Range…
 Edit – Group rows…
 Edit – Make header
 Edit – Add Header
 Edit - Category Property

57
The Unscrambler X Main

 Viewer mode
 Edit - Add Data
 Edit - Create Range
 Edit - Sample Grouping
 Edit - Copy all
 Edit – Draw
 Edit – Mark

The workspace editor Edit menu mode is activated by clicking anywhere in a data table.
The workspace editor Edit menu

The workspace viewer Edit menu mode is activated by clicking in a plot. The same menu will
be shown irrespective of whether it is a raw data plot or a model results plot, however some
menu items will be grayed out when not applicable to specific plots.
The workspace viewer Edit menu

58
Application Framework

The project navigator Edit menu is the simplest of the three.


The project navigator Edit menu

Common actions

Edit – Undo

or Ctrl+Z
This option reverses the last operation(s) performed on the data in the editor. This can be
used to Undo up to the last 10 operations. The size of the undo stack can be increased, see
Tools – Options… menu.
The following operations can be reversed with the undo operation:

 Cut, paste action in entry


 Cut, paste action with column, row, headers
 Change data type for column and headers
 Delete data action for entry (including headers)
 Delete row/column/headers action
 Drag and drop of entry/column/row/headers
 Move row, or column

59
The Unscrambler X Main

 Move row to column headers


 Move column to row headers

Edit – Redo

or Ctrl+Y
It is possible to recover the results of an editing operation(s) that has just been undone with
the help of the Redo command.
A selection can be recovered from the clipboard using the Paste command or Ctrl+V.

Edit – Cut

or Ctrl+X
This option removes the selected range, either data in the Editor or a plot in the Viewer, and
places it on the clipboard. Anything placed on the clipboard remains there until it is replaced
with a new item. Use the Paste command to copy the selection to a new location.

Edit – Copy

or Ctrl+C
With this option one can copy the selected range to the clipboard, overwriting its previous
contents. The selected range is not removed from its original place. Use the Paste command
to copy the selection to a new location.

Edit – Paste

or Ctrl+V
This command one to insert a copy of the clipboard contents at the insertion point. The
command is not available if the clipboard is empty or the selected range cannot be replaced.

Edit – Delete

, Ctrl+D or Del
This option enables one to delete columns or rows. One can select one or more
columns/variables or rows/samples, and deletes the selected section(s).
Any previously-defined sets are adjusted for the deleted range.
Navigator mode

Edit – Rename
Rename the currently selected matrix.

Edit – Spectra
Ranges can be defined as being spectra, and once this setting is ticked off for a given range,
loadings plots for these data ranges will display as line plots rather than 2D scatter plots.

60
Application Framework

Editor mode

Edit – Copy with Headers

or Ctrl+Shift+C
With this option one can copy the selected range to the clipboard, overwriting its previous
contents. The selected range is not removed from its original place. Use the Paste command
to copy the selection to a new location.

Edit - Insert Copied Cells


Inserts copied rows or columns from the selected position in the matrix

Edit - Append Copied Cells


Appends copied rows or columns to the end of a data matrix.

Edit - Reverse
With this option one can reverse the sample order and/or variable order in a selected
matrix. For more information see the reverse documentation.

Edit - Convert
This command allows one to convert the units of a column headers for spectral data from
wavelength in nanometers (nm) to wavenumber (cm-1) and vice versa. This function is
active when the the column header of a matrix is selected.

Edit - Fill
This command allows a user to fill a highlighted row or column range with either numeric or
categorical data.
For more details see the Fill section.

Edit – Find and Replace


Ctrl+H
This command allows one to find entries containing a given value or sequence of characters,
and replace the selected value with a new one. The Find search mode consists can be
selected as text, number and Date Time from the drop-down list. For more information see
the find and replace dialog documentation.

Edit – Go To…
Allows user to move focus to a specific entry in the data table.
For more information see the go to dialog documentation.

Edit – Select
Edit – Select has the following options
Select Rows
To select respective sample.

61
The Unscrambler X Main

Select Columns
To select respective variable.
Select Range
To select a range of samples and variables.
Select All (Ctrl+A)
To select the entire matrix.
In the first three cases, the user is asked to enter a range to select. It uses the same syntax as
the Define range dialog, e.g. 1,3-5,8-20.
Note: The Unscrambler® always works with either rows or columns. This also
applies when the whole matrix is selected. Look at the cursor shape or the
rows/columns numbers to see whether the selection is for a row or column mode.
Sample names will also be selected when operating on rows, and column headers
when operating on columns.

Edit – Sort
Sort samples according to their numerical values for the selected variable.
Sort has two options: Ascending and Descending.
Select one or more columns to sort. Headers can also be selected and used as sort keys.
This method uses the quick sort algorithm, which performs an unstable sort; that is,
if two elements are equal, their order might not be preserved. In contrast, a stable
sort preserves the order of elements that are equal.

Edit – Append
Row(s)/Column(s)…
This option can be used to append rows or columns, depending which entries are selected in
the data table.
A dialog is displayed allowing the user to enter the number of rows(columns) that are to be
appended at the end of the existing data matrix.
See Edit – Insert – Row(s)/Column(s)… below for details.
Category Variable…
Append a new category variable (column).
Details on how to specify a category variable can be found here.

Edit – Insert
Row(s)/Column(s)…
Insert new rows or columns.
Select a row or a column to insert either one or more rows or columns, respectively.
A dialog will pop-up to ask how many rows or columns to insert:

62
Application Framework

This command is also available by right click.


Category Variable…
Insert a new category variable (column).
Details on how to specify a category variable can be found here.

Edit – Split Text/Category Variable


Text: Converts text variable into multiple new text or category variables as needed.
Category: Create one new column for each level, with binary values (true/false). These will
be inserted to the left of the selected column.

Edit – Change Data Type


One can change the data type of one or several variables by selecting them and using the
option Change Data Type in the Edit menu. The available data types are:

 Text
 Numeric
 Date-time
 Category

This command is also available by right click.

Edit – Scalar and Vector


This item opens a dialog where units can be assigned to previously defined or new column
ranges. Each column range can also be defined as a scalar (e.g. single process variable) or
vector (e.g. spectrum).
For more information see the Scalar and Vector documentation.

Edit – Define Range…

or Ctrl+E
Create and edit ranges for easy access to often-used selections.
For more information see the define range dialog documentation.

Edit – Group rows…


Create row ranges based on a category variable or a variable split linearly into value ranges.
For more information see the add row range from column dialog documentation.

Edit – Make header


Convert the selected column or row to a header.
This action can also be invoked by right clicking on a row or column number.
The existing row or column will be removed as a result of making it a header, and a header
can not be converted to data.

63
The Unscrambler X Main

Edit – Add Header


Insert an extra header.
A row or column header must be selected to add either a new row or column header,
respectively. Choose to insert the row header above or below, or the column header to the
left or right.
There can be up to five column and row headers.

Edit - Category Property

This option allows one to change the properties of category variables, more details on which
can be found at Property dialog.
Viewer mode

Edit - Add Data


To be able to add data to an existing plot it is necessary to select Edit- Add Data….
The following dialog box opens.
Add Data… dialog box

It is necessary to locate the second set of data.


Matrix
Use the drop-down list if the data are in a data matrix and use the select result
matrix button if the data are in an analysis result.
Rows and Cols
Use the drop-down list if the subset is already defined and use the Define button if it
has to be defined.

Edit - Create Range


Once some samples / variables are selected in a plot it is possible to create a new range
including them. This can be done using the Edit - Create Range option or by right clicking on
the plot with the selected items and selecting the option Create Range.
The new range appears under the matrix that was plotted as a new row or column set.

64
Application Framework

Edit - Sample Grouping


For more information see the Sample grouping dialog documentation.

Edit - Copy all


This action will copy all plots in the current viewer to the clipboard and make it available for
pasting into documents, etc.

Edit – Draw
This option allows a user to add a drawing object to the plot. It is possible to draw with five
different types of objects: line, arrow, rectangle, ellipse or text. This option can also be
accessed by right clicking while in a plot and selecting Insert Draw Item
For more information see the plot annotation documentation.

Edit – Mark
Mark objects (samples or variables) to bring focus to them in plots and interpretation. There
are options for automatic sample or variable selection based on modeled data, or for
manual marking using the one by one, rectangle or lasso tools.
The submenu for marking objects

For more information see the marking in plots documentation.


A typical use of this command is to mark extreme samples in a score plot in order to
investigate the behavior of those samples on other plots. Another is to mark ranges of the
spectra in the Important variables plot, to make a new model based on only important
wavelengths.
Note: If the Viewer contains more than one plot, marking is only possible from the
currently active subframe. For instance, if the currently active subframe contains a
scores plot, only samples can be selected. In order to mark variables, one must click
on the subframe containing a variable plot in order to mark any variables.
Once objects have been marked, they appear marked in all current and future plots, until
they are unmarked or when the Viewer is closed.

4.12.2 Edit – Change data type – Category…


Access the category converter
The Category converter is accessible from two menus:

65
The Unscrambler X Main

Edit – Change data type – Category…


Select a variable. Go to the menu Edit and select the option Change Data Type and
from the four choices select Category….
Menu Edit – Change Data Type – Category…

Right click
Select a variable. Right click. Select the menu Change Data Type – Category….
Right click access to the Category Converter

66
Application Framework

Use the category converter


There are two way of creating levels for category variables:

 Use individual values


 Use ranges of values

Convert to category dialog

67
The Unscrambler X Main

New levels based upon individual values


If there were already some values in the selected variable each of them will be defined as a
level. Click on OK if this corresponds to what is needed.
The variable background changes color to differentiate it from the numerical variables.
It is possible to add new values for new samples or to select one of the available ones by
using the drop-down list.
Choices of levels in the drop-down list

New levels based upon ranges of values


If the variable to be converted into a category variable is a continuous variable, it is
recommended to use ranges of values.
To do so select the second option available in the Category Converter: New levels based
upon ranges of values.

68
Application Framework

New levels based upon ranges of values

The preselected variable is in the field Select Variable. If the variable to be used in a
different one select it using the drop-down list.
The field Value based on selected Variable gives information on the selected variables such
as:

 The number of different values,


 The minimal and maximal values.

This information is displayed to guide one to select the number of levels to choose and to
define the intermediate ranges.
Select the number of levels using the associated box.
Decide the method to be used to define the range among the two following options:
Divide total range of variation into interval of equal width
If this is the selected option the ranges will be automatically defined when changing
the number of levels.
Specify each range manually
Double-click on the entry to define the ranges.

69
The Unscrambler X Main

Note: It is not possible to have overlapping ranges. An error message will


appear if the entered value is not correct
When done, click on OK.

4.12.3 Edit – Category Property…


This option allows one to change the properties of category variables that have already been
defined. The name of the category column, as well as the name for any given category can
be changed. The order of categories can be changed, categories can be added, and already
defined categories can be deleted.

This is also available as a right click option. Highlight a column and right click, the following
options will be displayed

70
Application Framework

4.12.4 Edit – Fill


This option allows a user to select specified row or column ranges and fill them with either a
constant number for numerical columns, or text if the row or column is defined as text.
This option also allows selected rows to be filled with pre-defined categorical variables.
The dialog box for the Fill option is provided below.

To fill a column/row with a specified value, either highlight the entire row/column or select a
sub-section using the mouse and select Edit - Fill. Enter the specified value (or text) in the
Value box and click on OK. The selected region will be filled with this value.
Note: A block of rows and columns can also be selected using this option.
To fill rows/columns with a category variable, first define the categories using Edit - Change
Data Type - Category. Then select specified cells and use the Edit - Fill option, this time
selecting the desired category from the Level drop-down list. Click on OK and the cells will be
filled with this new category.

71
The Unscrambler X Main

The Fill option is also available as a right click option from the Editor.

4.12.5 Edit – Find and Replace


This command allows a user to find entries containing a given numerical value or word, and
replace the selected value with a new one.
There are three search modes: text, number and date-time.
Edit – Find and Replace (Ctrl+F, or Ctrl+H) launches the Replace pane, where one can specify
a value to search for, launch the search, and optionally define a replacement value and
perform the replacement. For replacing category variable with a new value not defined, a
warning will be displayed for creating a new category level.
Find and Replace:

72
Application Framework

Find option
By selecting the Options button, one is then presented with Find Option choices which
enables one to match case, replace entire entry contents with specified search criteria and
search in indicated directions in the data matrix.

How to find a number, text string, date/time and category

 Select search type Numeric, Text or Date time from the Search mode drop-down
list.
 Type a word, a number, or a date to search for in the Find what field.
 Or tick Range to search within numeric or date limits. This option works only for
Numeric and Date time variables
 For replacing category values, select the varaible and use the Find and Replace
option.

**Text** mode will match category variables. A category level labeled "200" is still
a text string. It is recommended to use words to label category levels both to avoid
confusion and to give each level meaning, such as "High" or "Low".
Click the Find Next button to locate a cell with the chosen value or sequence of characters.
If the search is successful, the entry is marked in the editor with a black frame (or a white
frame if the search is occurring in a selected area). If no match is found, the cursor does not
move from its original place.

Advanced search options


In addition, one can make a more specific search by clicking Options which will expand the
dialog with additional search parameters:
Match case
Make search case sensitive.

73
The Unscrambler X Main

Replace entire cell contents


Find only entries which have the requested sequence of digits or characters as exact
contents.
Search criteria
Specify how text is matched.
Choose Contains, Equal, Starts with, or Ends with from the drop-down list.
Search direction
Set search order to traverse horizontally first (by row), or vertically first (by column).
Restricted to selection
Base search on preselected data only.

How to replace a value with another


Once a value has been specified for the Find what value, proceed with a replacement.
In the Replace with field, type in the new value or sequence of characters. Any combination
of digits and characters is allowed, e.g. A51-02.b.DSF24%. However, if the requested value
is not compatible with the current type of entry (e.g. “A51” in a numeric entry), an error
message will be displayed and no replacement will be made.
If the Find what value has already been located with the Find Next button, hit the Replace
button to replace the value in the current entry. In order to make the replacement in all
entries containing the Find what value, hit the Replace All button.

How to undo replace


The Undo button is available once a replacement has been performed. Clicking it reverses
the last replacement made.
If the Find and Replace dialog has already closed, use the Edit – Undo command (Ctrl+Z) to
revert the change.

4.12.6 Edit – Go To…


Use Edit – Go To… to move focus to a given data matrix location. This function is active
when the cursor is in an active matrix window.

74
Application Framework

Enter the desired destination row and column numbers.

Result after:

This function allows to quickly move around to specific entries in a data matrix.

4.12.7 Edit – Insert – Category Variable…


This tool will insert a new column with a category variable, either by manually entering
levels, or deducing true/false levels based on one or more non-overlapping row sets.
Create category variable: Specify levels manually

75
The Unscrambler X Main

Create category variable based on a row set

The resulting category column can look like this:

76
Application Framework

4.12.8 Edit – Define Range…

or Ctrl+E
Ranges define specific parts of the data table in order to perform analyses on. When a set of
columns is defined, this is called a Column range and usually defines a specific set of
variables. These variable sets may define a single independent (X-data) range for methods
like PCA or two sets such as the X-data and the dependent Y-data for methods such as PLSR.
When a set of rows (or samples) is defined, this is known as a Row range and these are
useful when defining training and validation sets for any analysis method in The
Unscrambler®.
Combinations of row and column sets together define specific data regions to be used for
analysis purposes and the preparation of data can be performed using the Define Range
option.
Get information on:

 Accessing Define Range


 Define range dialog
 Create range from data editor
 Create range from scores plots
 Automatic keep outs

Accessing Define Range


The Define Range dialog can be accessed from:
Menu Edit – Define Range…

77
The Unscrambler X Main

If the case arises that a new range has to be defined during an analysis setup, most of the
plotting and analysis dialogs in The Unscrambler® have the Define button available. An
example from the PCR dialog is shown below
Define buttons in the PCR dialog

78
Application Framework

The Define button is shown as follows


By selecting this option from either the Edit menu or from an analysis dialog, the Define
range dialog box described in the next section will appear.
Define range dialog
Dialog
The Define Range dialog is a multi-task, interactive window for easily defining specific row
and column sets prior to analysis.
Define range dialog

79
The Unscrambler X Main

Tip: The F5 key toggles focus between viewer and editor.

Dialog Usage
Functions
The dialog box contains the following functions for easily defining sets within a selected data
table.
Row and Column Ranges
This section provides two lists of the available row and column sets available in a
table. To add a new row/column set, either interactively select the sets using the
data viewer with a mouse, or manually enter specific ranges into the text dialog
boxes. For example, if a new row set is to be defined called training, and it is to
cover rows 1-10 of the current table, the dialog for Row ranges should be set up as
follows,

To add the new row set to the list, click on the Create button. Use a similar procedure for
defining new column sets.

80
Application Framework

Updating an existing row or column set


If modifications have to be performed to an existing row or column set, simply
highlight the set from those available in the list, make the modifications using either
an interactive or manual change and click on the Update button. The set definition
will be updated accordingly in the list.
Inverting a selection
In some applications, the definition of training and test sets is an important step in
multivariate analysis. If a training set has been defined and the test set is to be
defined as the rest of the samples not defined by the training set, click on the Invert
Selection button , and the reverse of the current selection will be selected. To
add the inverted selection to the list, provide the row or column set with a unique
name and click on Create. This will define a training and test set which is particularly
useful when using Test Matrix Validation.
Range deletion
To remove existing rows or columns sets from a list, simply highlight the sets and
click on the Delete Range button
Using all of the actions described above, when the OK button is selected to apply the
changes, all of the defined ranges (or deletions) will be shown in the data matrix node in the
project navigator.
Keep out
Use this option to define samples or variables to be kept out in the analysis from the
defined range(s).
Variables and samples satisfying given conditions are automatically added to these
lists. For more information on how this works see below.
Special intervals The special intervals option can be selected for performing predefined
actions to a data table when defining row or column sets. To access this functionality, click
on the Special Intervals button

This will open an expanded options section as shown below,

81
The Unscrambler X Main

The functions in this section are described below.


Interval
Insert regularly spaced row or column indices using the drop-down list “Samples”
and “Variables” values. There are two parameters to enter:

 The frequency: the Every field refers to the frequency of sampling.


 The starting sample in the field Starting from spin box.

Use this option to define evenly spaced calibration (or validation) samples and use the Invert
function described above to easily define such sets.
Random
Insert random row or column indices using the drop-down list “Samples” and
“Variables” values and indicating a number to define in the manual entry box.
Category
Insert row indices based on a category variable. Select the category variable in the
drop-down list.
When the appropriate ranges have been selected click OK to apply the changes.
Create range from data editor
Ranges can be created directly within the data set editor: Begin by selecting the part of the
table that will be included in the range and right click to select the option Create Range,
Create Row Range or Create Column Range as appropriate.
Create Row Range

82
Application Framework

Create range from scores plots


Sample sets can be created directly from the PCA/PCR/PLSR scores plots as well. Select some
samples using any of the Edit - Mark options and then right-click Create Range. In the dialog
that opens there is an option to use either the marked or unmarked samples (or both). The
selected samples will be added to a new or existing matrix in the project navigator.
See extract samples documentation for details.
Automatic keep outs
Variables and samples not applicable in calculations are automatically added to the lists of
Keep outs. Entries are excluded based on the following (method dependent) criteria:

 Samples with missing values1.


 Columns with category, text or date-time variables.
 Entire columns or rows with constant values.
 Columns where all values are missing.

Keep out warning dialog

83
The Unscrambler X Main

When working with data selector that have keep out samples/variables, an warning will be
displayed allowing the user to either accept and proceed with keep outs or to cancel the
action. The Details option will display the list of keep outs.
To keep track of row and column exclusions, the data selectors provides a warning to users
that exclusions have been defined. Click on the More details link to see what has been
excluded.
More details

Automatic keep outs can only be removed manually. This means that in cases where a
category variable has been converted to a numeric column, or missing entries have been
filled in, the keep out lists must be edited to include given entries in further analyses.

 With the exception of NIPALS based methods.

84
Application Framework

4.12.9 Edit – Reverse…


The order of samples and variables in the data matrix can be reversed by choosing the Edit -
Reverse option from the menu when the cursor is in a data matrix.
The Reverse option menu is shown below

4.12.10 Edit – Group rows…


Select a variable to be used for the definition of row ranges. This variable can be:

 Either a category variable


 Or a numeric variable.

Then access the option Group Rows from the menu Edit. A dialog box will open.
Add row ranges on a category variable
When the variable selected is a category variable, all levels will be used to define new
ranges. Therefore the Number of group is disabled.
Add row ranges dialog from category variable

When clicking OK, new row ranges are defined being named in the same way as the levels.

85
The Unscrambler X Main

Add row ranges on a numeric variable


When the variable selected is a numeric variable, the Number of group has to be specified.
The ranges are divided linearly in equal ranges of values.
Add row ranges dialog from numeric variable

When clicking OK, new row ranges are defined being named range1, range2, etc.

4.12.11 Edit – Sample grouping…


The menu option Edit – Sample grouping… can be used to group samples in a plot. This can
also be accessed in any plot by a right mouse click.
This feature is available in the general following plots:

 2D or 3D scatter plots (including score plots)


 Line plots
 Bar plots

When clicking on the menu Edit – Sample grouping…, the dialog box Sample grouping &
marking opens.
Select the matrix to use for sample grouping in the Data frame. All available row sets will
appear in the dialog. They can be selected and moved to Marker settings by using the
arrows. The sample grouping will be based on the groups added to this box. Clear the
available row sets using the Clear button.
Alternatively the user can select a single column from the matrix to use for sample grouping.
If the selected column is a category variable, click Create Row Sets in order to make each
category level available for grouping. If the selected column is of numeric data type, Create
Row Sets will split the samples into a number of equally spaced ranges defined by the
Number of groups box. When created in this dialog, the ranges are created temporarily for
marking the samples. These ranges are not added to the data table in the project navigator.
To delete a selected group from Marker settings, mark the group and use the Remove
button. Alternatively use the Clear All button to remove all defined groups.
The user has the option to separated samples based on colors, symbols or both, and the
group name can optionally be used as point labels. Use the Apply button to preview the plot
settings, or click OK to apply the settings and close the dialog.
The user also has the option to label the samples by pre-defined values that may be
available in a particular column of a data sheet. The appropritate matrix and the
corresponding column need to be selected using the Data for labeling matrix. This will be
enabled only when value is selected from the Label option.

86
Application Framework

Sample grouping and marking dialog

4.12.12 Scalar and Vector


The Scalar and Vector dialog box allows user to define additional properties of data. Data
may be acquired from different sources and these properties help identifying the data
during online processing.
Scalar and Vector Dialog

87
The Unscrambler X Main

In the above dialog user can perform following:

 Define new column sets and their properties


 A single variable column range is defined as a Scalar and the Units, Min and Max
values can be specified. For example a scalar Temperature can be specified within an
allowed range of 25 to 35 degrees Celsius by setting Units=C, Min=25 and Max=35
 A multi-variable column range is referred to as a Vector. This is usually a spectrum
where the Start and End wavelength can be defined. For instance an NIR absorbance
spectrum can have Units= and Start and End wavelengths of 1100 and 2500,
respectively.
 The Min/Max values are disabled for Vectors and Start/End values are disabled for
Scalars

4.12.13 Split Text Variable


It is a text parser function that takes any text variable or row header and splits it into
multiple text or category variables as desired. This function can be accessed from Edit-Split
Text Variable or right-click menu option after selecting a row header or variable of type
‘text’.
The split text function works with two options separator and character position.
Separator:
This feature is similar to ASCII import accommodating commonly used separator
types comma, space, semicolon and custom values. Double quotes and consecutive
separators can be handled efficiently.
Split by separator dialog

88
Application Framework

Character position:
This feature splits text variables into new variables based on the position of the
characters only. The start split value indicates the number of characters to split and
so the second split. The default value for first split is 0 and second split is 6.
Split by character position

89
The Unscrambler X Main

Output options:
The following output options are available.

 In case the user is interested to retain one or few of the new variables after split, the
range of columns in numeric can be defined in ‘Insert Columns’ using commas and
dashes. The selection can also be set using the mouse in the preview window.
 The output variables can either be converted to category type using the option
‘Convert to category’ or append all the output variables as text to existing row
headers using the option ‘Add headers’.

4.13. View
4.13.1 View menu
The View menu has two different modes, and the displayed options depend on which part
of the application window is active at any given time. There are separate modes for the
workspace editor and viewer.

 Editor mode

90
Application Framework

View – Navigator
View – Info
View – Level Indices
 Viewer mode
 View – Graphical
 View – Numerical
 View – Auto Scale
 View – Frame Scale
 View – Zoom In
 View – Zoom Out
 View – Legend
 View – Properties
 View – Full Screen
 Context dependent plot indicator lines
 View – Trend Lines – Target Line
 View – Trend Lines – Regression Line
 View – Uncertainty Limit

The workspace editor View menu mode is activated by clicking anywhere in a data table.
The workspace editor View menu

The workspace viewer View menu mode is activated by clicking in a plot. The same menu
will be shown irrespective of whether it is a raw data plot or a model results plot, however
some menu items will be grayed out when not applicable to specific plots.
The workspace viewer View menu

Editor mode

View – Navigator
Toggle project navigator pane on/off.

View – Info
Toggle information pane on/off.

View – Level Indices


Available when a data set has category variables. Toggle category variable view as level
integers on/off.
Viewer mode

View – Graphical
This lets the user view the selected data of a Viewer in a graphical mode. This is the default
view for The Unscrambler®.

91
The Unscrambler X Main

View – Numerical
Through this option a user may display results plotted in a Viewer as a numerical table. One
can copy that data table to the Clipboard and paste it into an Editor.
Restore the plot using View – Graphical

View – Auto Scale

This option scales the plot so that all data points are shown within the Viewer window. This
command is useful after using Add Plot and Scaling.

View – Frame Scale

This option scales the plot in a selected frame. One can change the plot by scaling its axes to
fit the desired range. Select the desired area to zoom in a frame.
Use Autoscale to display the plot as it was originally.

View – Zoom In

This option changes the plot scaling upwards in discrete steps, allowing one to view a
smaller part of the original plot at a larger scale. This can also be done by using the + key on
the graph.

View – Zoom Out

This option scales the plot down by zooming out on the middle of the plot, so that more of
the plot becomes evident, but at a smaller scale. This can also be done by using the - key on
the graph.

View – Legend

This option allows the user to add a legend to an existing plot.

View – Properties

This opens a dialog where a user can customize a plot. Here one can change plot
appearance, such as grid, axes, titles, fonts and colors.
See the formatting of plots documentation.

View – Full Screen

Make the plot fill the whole screen. Press Esc on the keyboard or right click to leave the full
screen mode.

92
Application Framework

Context dependent plot indicator lines


Trend lines are available to help interpreting Predicted vs. reference plots.

View – Trend Lines – Target Line

Insert a target line in a 2-D scatter plot.


The target line is the line with slope = 1.0 and offset = 0.0 (or equation Y=X). In many cases
this line will be the optimal solution, e.g. in predicted vs. reference plots.

View – Trend Lines – Regression Line

A regression line is drawn between the data points of a 2-D scatter plot, using the least
squares algorithm.
Available for Predicted vs. reference plots.

View – Uncertainty Limit


Uncertainty limits can be indicated using this option for regression coefficients line plots.
For more information, see Martens’ Uncertainty Test and how to plot regression
coefficients.

4.14. Insert
4.14.1 Insert menu
Use the Insert menu to add items to the project navigator.

Insert – Data Matrix…


Add a new data table, which may be empty, or filled with predefined values.
See the insert data matrix dialog documentation.

Insert – Create Design…


Create a designed experiment table to perform a DOE.
See the design experiment wizard documentation.

Insert – Duplicate Matrix…


Create a replicate of an existing data table.
See the duplicate matrix dialog documentation.

Insert – Custom Layout


Create custom layouts for plotting any data matrix or results in a two-plot or four-plot
viewer.
See the custom layout dialog documentation.

93
The Unscrambler X Main

4.14.2 Insert – Duplicate Matrix…


When working with data, it is advisable to always maintain a copy of the raw data.
In addition, to use matrices generated while running an analysis for other purposes, it is
necessary to duplicate them. Select the matrix to be duplicated and use the menu option
Insert – Duplicate Matrix… to obtain a replicate of the data table.
This will create a second data matrix, bearing the same name with a replication number in
parentheses, for example “(1)” for the first replication. It is now possible to work on this
replicated matrix.
Duplicate matrix dialog

A window will open, so as to enable a specific selection of the matrix and ranges to
duplicate.
Duplicate matrix dialog

When hitting the OK button, a second data set will be created, bearing the same name with
a replication number in parentheses, for example “(1)” for the first replication.
The structure of the table (row and column ranges) will be maintained.
Duplicated matrix

94
Application Framework

4.14.3 Insert – Data Matrix…


In this section, information is given on how to create a new data table. This can be done
from the Insert menu, selecting Data Matrix….
When clicking on this option the Add Data Matrix dialogue appears where one can define
the size of the data matrix in terms of rows for the samples, and columns for the variables.
By default, the values are 10 both for the number of rows and columns. This can be edited
by using the arrows or by directly typing in the desired number.
The initial values for the matrix can be chosen from the following options in the drop-down
list in the Add Data Matrix Dialog:

 Blank
 Unit matrix (diagonal 1 rest 0)
 Random values (0-1)
 Random values (Gaussian)
 Constant
 Serial numbered rows
 Serial numbered columns
 Serial rows with shift

If Constant is chosen, this value should then be entered in the Constant value field.
The Include Headers option will automatically display the default header names for Rows
and Columns in the data matrix.

95
The Unscrambler X Main

After clicking on OK, a matrix will be created with the default name “Data Matrix”. It
contains no values if Initial values were set to Blank, otherwise the designated values are in
the entries. Data can be entered into the empty cells.
Fill a data table
Data may be entered into a blank data table in several ways.
Manually
Data can be entered manually by double clicking on the specific cell and entering the
value. This operation can be done for the data table as well as the sample and
variable name.
Copying data from a spreadsheet (Excel)
Data can be copied from Excel to The Unscrambler® by either drag and drop, or by
copying and pasting it. To drag and drop the data from Excel, it must be selected in
Excel and then dragged into the specific entry or to the beginning (top left corner) of
the area where the data are to be added. The same can be done for the sample and
variable names. Data can also be entered from Excel by using the copy and paste
functions.
Rename
The default name of the data table is “Data Matrix”, but this can be renamed with a more
descriptive name. Rename the data matrix by right clicking on the data matrix icon in the
project navigator and selecting the option Rename.
When this is done, the name will be updated in the project navigator as well as in the
visualization window and navigation bar.
Other functions are also available from this right click menu.
Other approaches to adding data matrices
There are two other options to generate a data table in The Unscrambler®:

 Importing data
 Create a design table

4.14.4 Insert – Custom Layout…


The Custom Layout tool is a way to display any two or four selected plots.
It can be very useful for example to display the results of two PCA analyses with two
different pretreatments as shown in the plot below for easier comparison.
Custom Layout of two PCA score and loadings plot with or without pretreatment

96
Application Framework

To access this option select the menu Insert – Custom Layout… and select the desired
layout:

 Four viewers,
 Two Horizontal…,
 Two Vertical….

Insert – Custom Layout… menu

This menu give access to a dialogue box divided in four parts corresponding to the four
frames of the visualization window, all containing the same options:
Custom Layout Dialog

97
The Unscrambler X Main

Choose Matrix
This button is used to select the data set and variables to be plotted. By clicking on
Matrix it is possible to select a data matrix from the navigator. Adjust the Rows and
Cols to display only what is appropriate.
Choose Matrix dialogue box

To select a matrix that was generated during an analysis, hit the select result matrix
button . The following dialogue box will appear. From here it is possible to
select any matrix.
Choose Matrix - Analysis dialogue box

98
Application Framework

Type
This drop-down list presents the plot options:
Type drop-down list

 Scatter: Click to see information about Scatter plots.


 Bar: Click to see information about Bar plots.
 3D Scatter: Click to see information about 3-D Scatter plots .
 Line: Click to see information about what a Line plot .
 Matrix: Click to see information about Matrix plots.
 Histogram: Click to see information about Histogram plots .
 Normal Probability: Click to see information about Normal Probability plots .
 Multiple Scatter: Click to see information about Multiple Scatter plots .

Title
Type in the title to be displayed on the specific plot.
Once all the necessary plots have been defined hit the OK button, this action will display the
selected plots.
It is always possible to abort this action by clicking the Cancel button.
Once the plots are displayed they are editable using the Properties menu accessible from a
right click on the plot or from the menu shortcut .
Further information is available for the following options:
 Format a plot,
 Annotate a plot,
 Zoom and re-scale a plot,
 Save and copy a plot.

99
The Unscrambler X Main

4.14.5 Insert – Data Compiler…


Data Compiler:
This section helps the user to process and filter bad and suspect spectra out of large
dataset based on combination of unique sample identifier and sample replicate
index. Sample identifiers or replicate scans will be identified using a categorical/text
variable and to split it, ‘Split Text/Category Variable’ feature in Edit menu is used.
When clicking on this option the Data Compiler dialog appears where one can
define the Input data, Filter settings and Output options.
Input data:
This tab provides the option to input numeric data (usually spectra) from any data
matrix in project navigator by defining the rows and columns. The sample index
allows the user to select a categorical variable; the number of samples should match
with the data selected. Non-category variable and multiple selection options will not
be allowed and all observations within one category level will be treated as
replicates of a single sample. The minimum number of replicates is used to specify
the minimum number of samples to include in average. The default value is 10 and
minimum value is 1.
Data Compiler - Input data

Filter settings:
The Filter settings tab provides option for primary and secondary filter settings.
Filtering can be done based on the models available in the project navigator and the

100
Application Framework

compatible models are PCA, PCR, PLSR and SCA. Models with auto-pretreatments
can also be defined by clicking the pretreatment button. Only full models are
acceptable.
Data Compiler - Filter Setting

Upon selection of the model, the available filter type can be selected. For PLS, PCR and PCA
the available filter matrices are

 Influence (T2 vs. F)


 Influence (T2 vs. Q)
 Leverage
 Hotelling’s T2
 Q-residuals
 F-test residuals SCA may have some or all of the above in addition to some or all of:
 Conformity limit
 Spectral match value The component provides the option to select the number of
components from the selected model. The default number of components is user
defined ‘set components’. User will also have the option to select the six levels of
significance, active for filter types Influence, Hotelling’s T2, Q-residuals and F-
residuals.

The Limit settings are active for the following filter types:

101
The Unscrambler X Main

 Leverage: Positive floating point value. Default value 1


 Conformity limit: Positive floating point value. Default value 3
 Spectral match: Floating point value in range 0-1. Default value 0.99

For additional filtering, ‘Include Secondary Filter’ has to be selected and this follows the
same feature as primary filter.
Output options:
The following output options are available.
Data Compiler - Output Options

Add Statistics: To store the output data after filter based on primary and secondary filters,
the tested model statistics from the filtered model will be added as new column(s) to the
original data table.
Add status: The test results from the filter model for status, when selected will be added as
new category column(s) to the original data. Influence filter type will have four status levels
as Good, Extreme, Suspect and Outlier. For all other filter types, the status levels are Good
and Outlier. Additionally users have the option to add the Good and Rejected row ranges to
the existing matrix.
Add ranges for Good and Rejected: When checked (default), two row ranges ‘Good’ and
‘Rejected’ are added to original (exisitng) data table. ‘Good’ and ‘Rejected’ status is defined
by the output from both filters as well as the minimum number of replicates. Any sample
that has status Good in either primary or secondary filter, and that exceeds the minimum
number of replicates, will be interpreted as Good. All other will be tagged as Rejected.

102
Application Framework

Add mean matrix: When checked, the average of all non-rejected observations are
calculated and returned for each sample. Users also have the additional option to add the
standard deviation for each sample. Average and standard deviation are calculated only if
the number of non-rejected replicates exceeds the minimum number entered in Input data
tab.
Add median matrix: When checked, the median of all non-rejected observations are
calculated and returned for each sample. Users also have the additional option to add the
range for each sample. Median and Range are calculated only if the number of non-rejected
replicates exceeds the minimum number entered in Input data tab.
Include column with number of replicates: When checked, the first column in output
matrices will be the number of replicates used for calculating the summary statistics.

4.15. Plot
4.15.1 Plot menu
The Plot menu has different modes: One comes with the matrix editor, and for each analysis
it gives a list of plots related to that analysis.
The plot interpretations chapter provides more detailed information for generic plots.
Editor mode

Plot – Line
The Line plot displays one or more data vectors. When plotting from the Editor, mark the
row(s) or variable(s) (Columns) to be plotted; one sample/variable gives a one-dimensional
plot; specifying a range adds several line plots.
One can define ranges or create ranges for samples as well as variables from the edit menu
Edit - Define Range, see using define range.
For more information see the line plot documentation.

Plot – Bar
The Bar plot displays data vectors as bars.
For more information see the bar plot documentation.

Plot – Scatter
The Scatter plot shows two data vectors plotted against each other.
When plotting from the Editor, select the two rows or variables (columns) to be plotted
before using the Plot command.
For more information see the scatter plot documentation.

Plot – 3-D Scatter


The 3-D Scatter plot shows three data vectors plotted against each other.
When plotting from the Editor, mark the three samples or variables to be plotted before
using the Plot command.
For more information see the 3-D scatter plot documentation.

103
The Unscrambler X Main

Plot – Matrix
In this plot, a two-dimensional matrix is visualized. The plot is useful to get an overview of
the data before starting any analyses, as obvious errors in the data and outliers may be seen
at once. One may also want to take a look at this plot before deciding whether to scale or
transform the data for analysis.
For more information see the matrix plot documentation.

Plot – Normal Probability


The Normal Probability plot shows the deviation from an assumed normal distribution of
the data vector. It is not possible to plot more than one row or column at a time in this plot.
Select the sample or variable to be plotted and use Plot – Normal Probability.
For more information see the normal probability plot documentation.

Plot – Histogram
This plot displays the distribution of the data points in a data vector, as well as the normal
distribution curve. A histogram gives useful information for exploring raw data. The height of
each bar in the histogram shows the number of elements within the value limits of the bar.
For more information see the histograms documentation.

Plot – Multiple scatter


The Multiple scatter plot shows a matrix of 2-D scatter plots for comparing several variables
in a flat view.
For more information see the multiple scatter plot documentation.
Viewer mode
After running an analysis, the Plot menu for the Viewer mode will change to a list of
available plots.
See the respective analysis method chapters for how to use and interpret these plots.

4.16. Tasks
4.16.1 Tasks menu
This menu is divided into three main groups of actions: Transform, Analyze and Predict.

Tasks – Transform
The Tasks – Transform options allows one to transform samples or variables to get data
properties which are more suitable for analysis and easier to interpret. Bilinear models, e.g.
PCA and PLS, basically assume linear data. The transformations should therefore result in a
more symmetric distribution of the data and a more linear behavior, if there are
nonlinearities.
The Unscrambler® offers many spectral pretreatments like derivatives, smoothing,
normalization, and standard transformations. All these can be found under Tasks –
Transform.

104
Application Framework

There is also a Compute_General function to transform data using basic elementary and
trigonometric mathematical expressions, and the matrix calculator, which has options for
linear algebra, matrix operations and reshaping of data.
For more information and a list of available transformations, see documentation for each
transformation

Tasks – Analyze
The Tasks – Analyze option provides multivariate analysis options consisting of:
Univariate statistics:

 Descriptive statistics, and


 Statistical tests

Qualitative multivariate analysis:

 Principal Component Analysis (PCA),


 Multivariate Curve Resolution (MCR),
 Cluster analysis, and

Quantitative regression techniques:

 Multiple Linear Regression (MLR),


 Principal Component Regression (PCR),
 Partial Least Squares Regression (PLSR), and
 Support Vector Machine Regression (SVR)

Special purpose methods:

 L-PLSR,
 Linear Discriminant Analysis (LDA),
 Support Vector Machine (SVM) classification, and
 Analyze design matrices

Tasks – Predict
The Tasks – Predict options provides means of applying a model on new samples for
prediction, projection or classification.
Projection
Project new samples to determine similarity with samples in a PCA, PCR or PLSR
model.
Regression
Predict unknown samples from regression models.
Prediction
SVM Prediction
Classification
Classification of unknowns by applying SIMCA, LDA, or SVM models.
SIMCA classification
LDA classification
SVM classification

105
The Unscrambler X Main

4.17. Tools
4.17.1 Tools menu

Tools – Modify/Extend Design…

or Ctrl + Shift + M
Open an existing experimental design for modifications.
See the modify design dialog documentation.

Tools – Matrix Calculator…

or Ctrl + M
The Matrix calculator is used to perform simple linear algebra functions like matrix
multiplication, addition, division, inverse etc. and to reshape, append or combine two
matrices.
See the matrix calculator dialog documentation.

Tools – Report…

or Ctrl + R
A tool to create reports as PDF documents with plots and data.
See the report generator dialog documentation.

Tools – Audit Trail…

This command displays the audit trail for the active project. The audit trail is a log of actions
by a user, showing a date and time stamp for the actions.
See the audit trail dialog documentation.

Tools - Run Scripts


Please refer to plug in specific help documentation for this add on options. Contact CAMO
Software for more details.

Tools – Options…

This dialog can be used to change the appearance of the data editor or viewer, as well as
other options in The Unscrambler®. Default numeric formats and plot settings can be
defined here.
See the options dialog documentation for details.

106
Application Framework

4.17.2 Tools – Audit Trail…


The audit trail provides a record of the actions performed by different users. Audit trails are
required for maintaining data integrity and are a requirement of Good Manufacturing
Principles (GMP) and the US FDA’s 21 CFR part 11 requirements for electronic signatures.
Caution: Audit trails are not a substitute for well-documented work.
For each operation, The Unscrambler® keeps track of:

 Date
 Time Zone
 Time
 User name
 Action.

The types of actions that are tracked in the audit trail include:
- Creation of the project - Import of data - Transformation: compute functions, smoothing,
MSC, derivative, etc. - Formatting: sorting, delete - Analysis: statistics, PCA, regression,
prediction, etc. with detailed model settings.
Audit trail dialog

In Non-Compliance mode, the audit trail can be emptied by selecting the Empty button in
the dialog.
The audit trail can be disabled from the Tools - Options under the General tab.
When in Compliance Mode, the Audit Trail cannot be emptied.It can only be saved in a non-
editable PDF document for further printing, if desired.
The Audit Trail for Compliance Mode is shown below. Also, in Tools - Options the Audit Trail
cannot be disabled in Compliance Mode.
Audit Trail in Compliance Mode

107
The Unscrambler X Main

4.17.3 Tools – Matrix Calculator…


Matrix calculator is used for simple linear algebra like matrix multiplication, addition,
division, inverse, etc. and matrix shaping. The options available are:

 Unary operations: Linear algebra on a single matrix


 Binary operations: Arithmetic operations on two matrices
 Reshape a single matrix
 Combine two matrices

The calculator tool should be used only with matrices that are purely numeric. In case there
are missing values those columns are kept out; likewise with text and category entries. With
the remaining matrix contents the compatibility follows the feasibility of the matrix
operations.
See also the Compute_General transform that can do calculations on samples and variables
using basic mathematical expressions.
Matrix calculator dialog

108
Application Framework

Matrix calculator’s shaping tab

Single matrix operations


Unary operations implies that the arithmetic operation is computed on a single matrix.

Inverse (X): Moore-Penrose matrix inverse


The Moore–Penrose inverse of an arbitrary matrix (including singular and rectangular) has
many applications in statistics, prediction theory, control system analysis, curve fitting and
numerical analysis.
In mathematics, and in particular linear algebra, the pseudoinverse A+ of an m × n matrix A is
a generalization of the inverse matrix.
A common use of the pseudoinverse is to compute a ‘best fit’ (least squares) solution to a
system of linear equations that lacks a unique solution. The pseudoinverse is defined and

109
The Unscrambler X Main

unique for all matrices whose entries are real or complex numbers and can be calculated
using the singular value decomposition.

Singular Value Decomposition (SVD)


In linear algebra, the singular value decomposition (SVD) is an important factorization of a
rectangular real or complex matrix, with many applications in signal processing and
statistics. Applications which employ the SVD include computing the pseudoinverse, least
squares fitting of data, matrix approximation, and determining the rank, range and null
space of a matrix.

QR decomposition
QR decomposition (also called a QR factorization) of a matrix that allows for the solution of
linear systems of equations.
It is a decomposition of the matrix into an orthogonal matrix (Q) and a right triangular matrix
(R). QR decomposition is the basis for a particular eigenvalue algorithm, the QR algorithm.

Element-by-element operations
Array arithmetic operations that are carried out element by element on one matrix.
X’X
Outer product of itself:
1./X
Reciprocal of individual matrix elements, or element-by-element product
X.*X
Square of the elements of X
Two matrix operations
Binary operations implies that the arithmetic operation is computed on the data and a
operand, defined by the rules of linear algebra:

 Addition: X+Y
 Subtraction: X-Y
 Multiplication: X*Y
 Matrix division: X*inv(Y)
 Element by element division: X/Y

The calculations that are possible depend on dimensionality of the matrices X and Y that
have been selected in the scope.
Add, Hadamard product and subtract require X and Y to have the same number of rows and
columns or Y has to be a row or column vector with the dimension matching with X.
The X and Y matrices in the calculations should not be confused with inputs and outputs of a
model.
Reshape matrix
Change dimensions of a two-dimensional matrix.
One can rearrange the elements of a matrix to change the number of rows and columns.
This is especially useful when importing data where a matrix has been stored as a one-
dimensional list of values.

110
Application Framework

Combine two matrices


A user can combine matrices with either of the two options:

 Augment X|Y: column-wise combination of matrices; i.e. 4x2 + 4x2 gives 8x2
 Append Y to X: row-wise combination or matrices

Augment requires X and Y to have the same number of rows. Append requires X and Y to
have the same number of columns.
These are binary operations in the shaping tab available only when the Binary operand box is
checked. This requires that the values be numeric. If there are columns of non-numeric data,
they will be kept out of the calculation. If there are missing values in either matrix, the rows
(columns) containing them will be kept out of the calculation.

4.17.4 Tools – Options…


This menu option allows the user to define user preferences for the viewer, general and
editor settings, to change the appearance and performance of The Unscrambler®.
General

This section contains options for the following


Select temporary folder
This is the location where The Unscrambler® stores temporary results during
calculations. These files will be removed when exiting the application.
Use audit trail

111
The Unscrambler X Main

Use this option to enable/disable the audit trail. Note: This option is not active when
the program is installed in Compliance Mode.
Prompt user to view plots
When checked, user will be prompted to view the model plots when opening a
project, after training a model and after predictions. This option will be unchecked if
the ‘Do not ask me again’ option is selected in the View Plots dialog.
Viewer
These options allow a user to set the default appearance properties of plots at the
application level. The settings can still be customized and changed at the plot level by editing
the properties for a given plot.

The following are properties that can be set from the Viewer:
 Antialiasing
Use this option to set antialiasing in all analysis-generated plots.
 Point label visible
Use this option to have the default view on plots have the point labels visible. Point
labels can be toggled on/off from a plot.
 Line plot point visible
Use this option to have the default view on line plots have the points visible. The
point can be toggled on/off from a plot.
 Point size
Use this option to set the default size of points. This can be changed for indivudual
plots under Properties.
 Line size
Use this option to set the default line size. This can be changed for indivudual plots
under Properties.

112
Application Framework

 Sample grouping point size


Use this option to set the default size of points when applying sample grouping. This
can be changed for indivudual plots in the Sample Grouping dialog.
 Crosshair axes color
Use this option to set the default color for plot axes. This can be changed for
individual plots under Properties.
Editor
These options allow a user to set the default properties of worksheet view at the global
level. This option will be available only when a data matrix is present in the project.

The following properties can be set from the Editor tab:


 General
This tab provides the settings for defining the maximum number of categories
(default - 50, maximum - 100000), maximum times to undo stack (default - 10,
maximum - 5000) and file size to disable preview (default - 10 MB)
 Format
This tab provides the settings for Numeric and Date time display format.
 Color
This tab provides the settings for color of Row header, Column header, Category and
Matrix name.
 Font
This tab provides the font settings for Row header, Column header and Matrix name

4.17.5 Tools – Report…


The Report Generator is a tool to generate customized reports.

113
The Unscrambler X Main

To access the Report Generator, select Tools – Report…. The Report generator dialog
appears and gives access to all matrices and plots in the current project. Add plots and
matrices in the field Included in report to create a customized report.
To add a matrix use the Data tables field and:

 Either select a data matrix that is in the Navigator as a node from the drop-down list
 Or select one from an analysis using the Select result matrix button

Then click on Add matrix.


To add a plot, select one in the Available plots list and move it to Included in report with the
right arrow.
Generate Report Dialog

.
At the bottom of the dialog are three tabs where the user can choose settings for the
security, report content, and page setup.
Security
Passwords can be enabled to limit the access for editing and viewing the report. The
user can highlight password protected editing of reports.
Printing, editing, copying, or annotating can be disabled for added security.

114
Application Framework

Content
Under the content tab the user can select to append notes, and/or use the editor
format for numbers.
Report Generator Content

.
Page Setup
On the Page Setup tab, a user can define the paper size (A2, A3, A4, letter, legal),
and orientation (portrait or landscape).
Report Generator Page setup

One can also preview a report by clicking on the Preview button.


Save the report and close the dialog using the appropriate buttons.
All reports will be saved in PDF format with a file name, and in a location given by the user.

4.18. Help
4.18.1 Help menu
The help menu provides access to help topics and licensing-related information in The
Unscrambler®.

Help – Contents

or F1
Open help viewer for browsing.
See the How to use help documentation.

Help – Search
Ctrl+F1
Open help viewer for searching.

115
The Unscrambler X Main

Help – Modify License


Change the current license of The Unscrambler® by typing in a new activation key. Use this
feature for instance to upgrade from a trial installation to a full version of The Unscrambler®.
See the modify license dialog documentation.

Help – User Setup…


Manage user profiles.
See the user setup dialog documentation.

Help – About
Shows;

 Software version number


 License holder and activation key
 Addresses to CAMO Software offices
 Additional information such as build number and date
 A list of all upgrades and plugins installed

The System Info button will open the “Windows System Information” utility.

4.18.2 Help – Modify License…


Use this dialog to activate or modify a license for The Unscrambler®.
Note that this requires certain privileges and may, in regulated environments, require the
intervention of a system administrator.
Press the Obtain button to request the activation key from the CAMO Software web site.
The activation key will be sent by email.

The above step requires an Internet connection.


Contact a sales representative by phone or fax if the computer is not connected to the
Internet. Note that the machine ID shown in this dialog would be required.

116
Application Framework

Company name and Email address fields become active when the activation key is for a
time-limited or perpetual license.
Contact details can be found at http://www.camo.com/contact

4.18.3 Help – User Setup…


From version 10.2 of The Unscrambler® the User Setup is only available in the Non-
Compliant mode of operation. For details of Compliant and Non-Compliant modes of
operation consult the installation guide or refer to the following sections,

 Login
 Compliance

Users are recommended to create a login and identification, which will not only secure their
work with The Unscrambler®, but provide valuable information to keep track of actions
taken on data, through the audit trail, where the user name is logged with any action.
Use the menu option Help - User Setup… to access the dialog.
User setup dialog

The above image shows an example of a completed setup. Enter the pertinent information
in the provided fields and then click Save.
The following is a brief explanation of the fields,
User Name
This is the name that will be shown in the login dialog each time the program is
started.
First Name

117
The Unscrambler X Main

The first name of the user.


Last Name
The surname of the user.
Initial
Usually the first letters of the first and last names entered.
Location
Here a user can enter the site/geography/company name associated with the
license.
Password Management
By checking the Password required at login option the user will be enforced to enter a valid
user name and password to use the software.
The following functions of this option are listed below,
Enter a Password
A user is required to enter a password of any size and detail into this field.
Re-enter Password
This option enforces a user to confirm that the two password entries are consistent.
If they are not, the following warning will be provided,
Password mismatch warning

Security Question
Select from a list of pre-defined questions to provide an answer to.
Answer
Enter the answer to the question here
If a password is forgotten, it can be retrieved provided the answer to the security
question is known. See the section on [Login](../signin.htm) for more details
Contact CAMO Software on information about how to register more than one user.
Contact details can be found at http://www.camo.com/contact

118
5. Import
5.1. Importing data
This section describes how to import data from supported instruments and software utilities
into The Unscrambler®.

5.1.1 Supported data formats


The Unscrambler® can import the following data formats:
Symbol Vendor

CAMO Unscrambler® X Models and Projects

CAMO Unscrambler® Version 9.8 or earlier

CAMO Unscrambler® DOS file format

Generic ASCII and other text based files

Microsoft Excel formats including .xlsx

Matlab data table files

rap ID vendor proprietary format

Universal spectroscopic file import

Universal chromatographic file format

Thermo universal file import

Bruker Optics OPUS proprietary format

119
The Unscrambler X Main

Brimrose proprietary format

ASDI Indico proprietary format

Thermo OMNIC proprietary format

Varian proprietary format

Guided Wave CLASS-PA proprietary format

FOSS/NIRSystems NSAS proprietary format

PerkinElmer proprietary format

DeltaNu proprietary format

Visiotec proprietary format

The following sections describe these import formats in more detail

The Unscrambler® data and models

 The Unscrambler® X
 The Unscrambler® 9.8 and earlier versions1

Version File name extensions2 Compatibility

X .unsb,.unsx3 Read, Write

X-9.0 .AMO Write

9.8–9.2 .??[DLPTW] Read, Write4

9.8–9.7 .??M Read

Non-proprietary data exchange formats

 ASCII, CSV and tabular text

120
Import

 NetCDF
 JCAMP-DX

Formats created by commonly used applications

 Microsoft Excel spreadsheets


 Matlab data files

Instruments

 Thermo Galactic GRAMS


 Brimrose
 OPUS (Bruker Optics)
 CLASS-PA & SpectrOn (Guided Wave)
 Indico (ASD)
 NSAS (FOSS NIRSystems)
 OMNIC™ (Thermo)
 Varian
 PerkinElmer
 RapID
 DeltaNu
 VisioTec

Interface protocols

 Databases

Other interfaces such as OPC and MyInstrument are supported. Contact CAMO Software for
details. http://www.camo.com/contact

5.1.2 How to import data


Choose which kind of file format to import from the File – Import Data submenu, select the
files to import and click OK.
Dialogs differ according to the type of file and the amount of user input required, allowing
the user to select which matrices to import. It also provides an option to preview data
before import.
File formats are recognized based on the file name extension. If the file(s) to be
imported does not have the expected extension, it may have to be changed
manually in a file manager.

Drag and drop files


Files can also be imported by dragging them from the file manager and dropping them on
The Unscrambler® application window.

121
The Unscrambler X Main

Drag and drop selections


Instead of going via the File – Import Data menu, data can be imported by using drag and
drop or copy and paste. Simply select the file/data in another Windows application like Excel
and drag it into the project navigator or the workspace of The Unscrambler®.
One can select whether to insert the data as columns or rows. The columns or rows are
appended at the end of the existing data table.
One may also overwrite the existing data in the Editor. The area that is going to be
overwritten is marked by a frame.

 See also the chapter on migrating to X.

 The file names are given in glob notation: ”*” mean any number of characters, ”?”
any character, “[ABC]” any of A,B or C.

 Support for XML is available via a separately installed export plug-in.

 Available via a separately installed export plug-in.

5.2. ASCII
5.2.1 ASCII (CSV, text)
Type of data
Array
Software
ASCII (American Standard Code for Information Interchange) is a character encoding
scheme and the de-facto file standard supported by many applications.
File name extension
*.csv, *.txt, *.*

 File format information


 How to use it

5.2.2 About ASCII, CSV and tabular text files


ASCII, CSV (character separated values) and tabular text are common names for essentially
the same format: Data saved as a plain text file.
The Unscrambler® supports ASCII formats with

 Typical file name extensions: .csv, .txt

122
Import

 Semicolon delimited files


 Files with the comma used for decimal point
 Tab delimited files
 Space delimited files
 Custom string used as delimiter e.g.: 1.4**4.5**6.7**8.9 ( “**” is given as custom
separator )

5.2.3 File – Import Data – ASCII…


ASCII files with different formats can be imported into The Unscrambler® through the File –
Import Data – ASCII menu. Single file or batch import is allowed.

 Single file import


 Batch import

Single file import


When a single text-file (e.g. .txt, .csv, …) file is selected for import, the following dialog is
used.
ASCII import dialog

Data delimiters
Numbers may be delimited by different characters in different ASCII files. Specify which
delimiter is used in the file to be imported, in the field Separator. The choices are

 Comma

123
The Unscrambler X Main

 Semicolon
 Space
 Tab
 Custom

Note: Carriage Return, Line Feed and Tabulation are not among the available
delimiters in the dialog. They are default item delimiters, and will automatically be
recognized as such. Do not specify them in the Custom field!
There is an additional list of check box options below:

Process double quotes


Interpret double quotes such that separators within double quotes are not
recognized as such
Treat consecutive separators as one
Consider multiple identical separator characters as one.
Normally used for tabular text files that have been aligned into columns using
spaces.
Data Type
There are three options available for data import
Auto- The Unscrambler® will import individual columns as text or numeric data
based on the values in the first row.
Numeric - The Unscrambler® will import all columns as numeric. Cells with non-
numeric content will be lost.
Text - The Unscrambler® will import the entire table as text data type.
Individual variables can be converted to other data formats after import using Edit – Change
Data Type.
Skip Rows
This option allows a user to skip a predefined number of header rows during the
import using the number spin box
Preview
This option allows a user to turn on/off a preview of the tabular data before import.
Headers
One can add multiple rows or columns as headers.
Sample and/or variable names can be selected using the Headers options; multiple columns
and rows can be selected for variable ID and sample ID, up to a maximum of 5 headers.
The user can select rows and columns from the data preview table while importing. One can
import all of a table, or just portions of it.
Note: If names are not enclosed in quotes in the ASCII file, they should not contain
any spaces if “space” is selected as the separator. (See Separators above.)

Missing data
Any text string entries in a numeric column will be imported as empty or missing data.

124
Import

Make sure that Treat consecutive separators as one is unchecked when importing ASCII files
that have empty entries for missing data, such as:

s4,0.618,,0.6022
Batch import
Often spectrometers output spectra in individual files, such that each file contains a single
spectrum (with or without headers). A selection of such single spectrum text-files can be
imported in a single step in The Unscrambler®, simply by selecting multiple files to open. A
simplified dialog is used for batch import.
Batch import dialog

Each spectrum is imported and appended to the previous spectra row-wise. If spectra are
given as a single row in the files, this means that each spectrum will become a single row in
the imported data table. If spectra are given column-wise (i.e. separated by carriage
return/newline), they should be transposed using the Transpose the data before import
check-box.
The sample file-names are included in a row-header in the imported table.
See section on single file import above for general import options.

5.3. BRIMROSE
5.3.1 Brimrose
Type of data/instrument
NIR
Data dimensions
Multiple spectra
Instrument/hardware
Snap!32 v2.03 (BFF3)
Snap!32 v3.01 (BFF4)
Vendor
Brimrose
File name extension
*.dat

125
The Unscrambler X Main

 File format information


 How to use it

5.3.2 About Brimrose data files


This option allows for the import of BFF3 and BFF4 data from Brimrose instrument files. The
BFF3 file is created from Snap!32 v2.03 while the BFF4 file is created from Snap!32 v3.01.

5.3.3 File – Import Data – Brimrose…


One or several Brimrose files (BFF3 or BFF4) can be imported into a project in The
Unscrambler®.

How to import data


Select the files to import from the file list in the Brimrose Import dialog or use the Browse
button to display a list of available files. The different files must have the same number of X-
variables to allow simultaneous import.
Brimrose Import

The source files may contain one or more samples per file; multiple selections allow several
samples to be imported at the same time.

Multiple selections
Select one or more files to import by checking the check box next to each file, or by using
Auto select matching spectra.
The contents of all the selected spectra will be merged to create a one data matrix during
import.
Deselect all
Clear the current selection by unselecting all samples.

126
Import

Preview spectra
Check to review a plot of selected spectra before importing.
Sample naming…
Include sample names or sample numbers in the resulting data table.
Sample names will only be imported if they are present in the source file.

Auto select matching spectra


The Auto select matching spectra preview option allows the automatic selection of the all
data file(s) with the same wavelength ranges as the current selection. A screenshot of the
Brimrose Import dialog with the auto select chosen is provided below.

Once Auto select matching spectra has been checked it will select only those files that have
the same number of variables.

Sorting data
The file name, number of samples, number of X-variables, wavelengths for the first and last
X-variables, and step (increase in wavelength), are displayed for each file.

127
The Unscrambler X Main

Step is the increment in wavelength (or wave number) between two successive variables.
The following relationship should be true:

First X-var + Step\*(Xvars-1) = Last X-var


The data table resulting from the import can be sorted based on any of these columns in the
file list: Click on a column header to set sort order, and a second time to reverse the sort
order.

Preview
Preview spectra displays a line plot of selected files that have been selected for import.

5.4. Bruker
5.4.1 OPUS from Bruker
Type of data/instrument
FT-IR, FT-NIR, Raman
Data dimensions

128
Import

Single spectra
Instrument/hardware

Software
OPUS
Vendor
Bruker
File name extension
*.0x, *.1

 File format information


 How to use it

5.4.2 About Bruker (OPUS) instrument files


One or several spectra from OPUS data files generated by Bruker instruments using OPUS
software can be imported. The import supports 2-D spectral files. When multiple spectra are
contained in a file, the preference is to import the normalized spectrum. However if a file
contains a single spectrum (sample or reference alone), then these will be imported. Data
files containing 3-D spectra are not supported.

5.4.3 File – Import Data – OPUS…


This option supports the import of data from OPUS files generated by Bruker instruments
using the OPUS software.
Data files containing 3-D spectra are not supported.
In the OPUS Import dialog box, one can choose a folder where OPUS files are stored. A list of
OPUS files from which data can be imported is then displayed.
Note: Multiple files that vary in their spectral range and resolution cannot be
imported together.

How to import data


Select the files to import from the file list in the dialog OPUS Import or use the Browse
button to get a list of available files. The different files must have the same number of X-
variables to allow simultaneous import.
OPUS Import

129
The Unscrambler X Main

Multiple selections
Select one or more files to import by checking the check box next to each file, or by using
Auto select matching spectra.
The contents of all the selected spectra will be merged to create a one data matrix during
import.
Deselect all
Clear the current selection by unselecting all samples.
Preview spectra
Check to review a plot of selected spectra before importing.
Sample naming…
Include sample names or sample numbers in the resulting data table.
Sample names will only be imported if they are present in the source file.

Interpolate

130
Import

By checking the Interpolate option this allows the import of data with different
starting and ending points, provided the number of points is the same in all sets to
be imported.
When the % button is selected, the following dialog appears allowing a user to set
the Tolerance for allowing data with different start or end points to be imported.
Interpolate Tolerance Dialog

For more information see the section on Import_Interpolate

Auto select matching spectra


The Auto select matching spectra preview option provides automatic selection of all data
file(s) with the same wavelength ranges as the current selection. This dialog is used for
import of spectral data from instruments with OPUS file format. A screenshot of the OPUS
Import dialog with the auto select option chosen is given below.

Once Auto select matching spectra has been checked, the files in the list having the same
number of variables will be selected.
Use the Interpolate option to import data with different start or end points.

131
The Unscrambler X Main

Sorting data
The file name, number of X-variables, wavelengths for the first and last X-variables are
displayed for each file.
The data table resulting from the import can be sorted based on any of these columns in the
file list: Click on a column header to set sort order, and a second time to reverse the sort
order.

Preview
Preview spectra displays a line plot of selected files that have been selected for import.

5.5. DataBase
5.5.1 Databases
Type of data
Array
Software

132
Import

ODBC/ADO compliant databases

 File format information


 How to use it

5.5.2 About supported database interfaces


This feature allows a user to import data from a wide selection of databases that are
ODBC/ADO compliant.

5.5.3 File – Import Data – Database…


Data can be imported from a database into a project in The Unscrambler®.
Since there are many possible database platforms and the data structure may be complex,
the user must go through several tabs in order to specify the import:

 Provider: Database service protocol to use


 Connection: Server address and user authentication
 Advanced: Network settings
 All: Initialization properties

Note: The Data Link Properties dialog is a standard Windows dialog. Depending on
the local language setup, this dialog may be displayed in another language other
than English. The name of the dialog will be different, the fields will have a different
text, but the layout and meaning of all fields will be the same as described
hereafter. For additional information, click Help; this will start the Microsoft help
system related to the current sheet in the Data Link Properties dialog.
The next two sections describe the standard stages to go through in order to establish a
connection from The Unscrambler® to a database.

Data link properties dialog: Provider


In the Provider tab of the Data Link Properties dialog, select the database provider to
import from.
Data Link Properties, Provider sheet

133
The Unscrambler X Main

Hit Next to shift to the next dialog sheet, Connection.

Data link properties dialog: Connection


In the Connection sheet of the Data Link Properties dialog, locate the desired database from
the proper server and specify the security settings for logging on to the database.
Data Link Properties, Connection sheet

134
Import

Specify the following three fields:


 Specify the source of data prompts for a choice between:
Use data source name
select from the list, or type the ODBC database source name (DSN) to access. More
sources can be added through the ODBC Data Source Administrator. Refresh the list
by clicking Refresh, and
Use connection string
allows the user to type or build an ODBC connection string instead of using an
existing DSN.
 Enter information to log on to the server: type the User name and Password to use
for authentication when logging on to the data source. Ticking box Blank password
enables the specified provider to return a blank password in the connection string.
Tick Allow saving password to allow the password to be saved with the connection
string.
 Enter the initial catalog to use: type in the name of the catalog (or database), or
select from the drop-down list.
Once everything is specified, press Test Connection to check whether contact with the
desired database has been successfully established. If the connection fails, ensure that the
settings are correct. For example, spelling errors and case sensitivity can cause failed
connections.

135
The Unscrambler X Main

Data link properties dialog: Advanced


Go to the Advanced Tab to choose network settings, set connection timeout, and access
permissions.
Data Link Properties Advanced Tab

Data link properties dialog: All


The All tab is provider-specific and displays only the initialization properties required by the
selected OLE DB provider.
Data Link Properties All Tab

136
Import

To edit a value, select it, and click the Edit Value… button, which opens the dialog where a
property can be changed.

Import from database dialog


From the List of tables, select the data table to access. The List of fields to the right is then
updated accordingly.
Select database tables

137
The Unscrambler X Main

Press the Next button to preview the data and proceed to complete the import.
Preview data before import

138
Import

The data types will be detected for individual columns and imported as numeric values or
text.

5.6. DeltaNu
5.6.1 DeltaNu
Type of data/instrument
Raman spectrometer
Data dimensions
single vector spectrum or multiple spectra in an array
Instrument/hardware
NuSpec software
Pharma-ID Raman spectrometer
Vendor
DeltaNu
File name extension
*.dnu, *.lib

 File format information


 How to use it

5.6.2 About DeltaNu data files


This option allows for the import of data files generated by the DeltaNu Raman
spectrometers using the NuSpec software. The files may have a single or multiple spectrum
in them. Typically the file extensions are .dnu or.lib, but are not limited to having such a file
extension.

5.6.3 File – Import Data – DeltaNu…


This option allows a user to import data from the DeltaNu Pharma-ID Raman spectrometer
operating with NuSpec software. Files with the following file name extensions are
supported: .dnu.

How to import data


From the File – Import Data menu, select DeltaNu. The DeltaNu dialog box displays a list of
files from which one can import data generated using the NuSpec software from DeltaNu. If
necessary, click the Browse button to access files from a different folder.
DeltaNu import

139
The Unscrambler X Main

Multiple selections are possible, by checking the box next to more than one file. The
selected samples must be of the same size (variables must match).

Multiple selections
Select one or more files to import by checking the check box next to each file, or by using
Auto select matching spectra.
The contents of all the selected spectra will be merged to create one data matrix during
import.
Deselect all
Clear the current selection by unselecting all samples.
Preview spectra
Check to review a plot of selected spectra before importing.
Sample naming…
Include sample names or sample numbers in the resulting data table.
Sample names will only be imported if they are present in the source file.

140
Import

Auto select matching spectra


The Auto select matching spectra preview option allows the automatic selection of all data
file(s) with the same wavelength ranges as the current selection. This dialog is used by
spectral data imports from instrument formats such as DeltaNu, GRAMS, OPUS, etc.

Sorting data
The file name, number of X-variables, wavelengths for the first and last X-variables, and step
(increase in wavelength), are displayed for each file.
The data table resulting from the import can be sorted based on any of these columns in the
file list: Click on a column header to set sort order, and a second time to reverse the sort
order.

Preview
Preview spectra displays a line plot of selected files that have been selected for import.

141
The Unscrambler X Main

5.7. Excel
5.7.1 Microsoft Excel spreadsheets
Type of data
Array (spreadsheet)
Software
Excel (part of Microsoft Office)
Vendor
Microsoft
File name extension
*.xls, *.xlt, *.xlsx, *.xlsm

 File format information

142
Import

 How to use it

5.7.2 About Microsoft Excel spreadsheets


Data in Excel Workbooks from Microsoft Excel 97 and newer can be imported:
The Unscrambler® supports the OOXML (Office Open XML) file format that was introduced
with Office 2007 with more than 255 columns. Users should remove any formatting from
spreadsheets before importing into The Unscrambler®.
Binary Excel 2007 workbooks with file name extension .xlsb are not supported.

5.7.3 File – Import Data – Excel…


The Excel Workbook files must have the file name extensions .xls or .xlsx to be
recognized by The Unscrambler®.
Note: The Unscrambler® supports the OOXML format (.xlsx file name extension)
with more than 255 columns.
Note: Users should remove any formatting (particularly borders) from spreadsheets
before importing into The Unscrambler®. To avoid data type recognition problems
on import, make sure there are no empty cells in first row of values.

To import data into The Unscrambler®


From the menu choose File – Import Data – Excel… to select an Excel file to open. Once a file
has been selected the Excel Preview dialog opens. An Excel workbook may contain several
worksheets. Select the worksheet that contains the matrix to be imported from the drop-
down list Select sheet or named range.
Once the sheet or named range are selected, the data preview window will open. The
screenshot below shows the Excel preview window, which enables the user to select the
desired data sheet, header and data selection of rows and columns.
Excel Preview

143
The Unscrambler X Main

All ranges that have been defined with names in the selected Excel sheet are listed under
Range names. Multiple row and column headers can be specified in headers, with up to a
maximum of 5 headers.
The sheet range is updated automatically if a range name is selected. The range can also be
entered manually, specifying the Rows and Columns, e.g. 2:1. All cells lying within this
rectangle are then imported.
Select the appropriate ranges as described above for the data values from the selection
option, as well as for the rows/sample and columns/variable names, if relevant.
Columns and rows can be removed from the import by selecting them within the preview
grid and pressing Del on the keyboard.

Data type
If the worksheet contains non-numeric values or a mixture of numeric and non-numeric
values, they can be imported. The radio button Auto can be selected to detect the data
format in the Excel spreadsheet and maintain that on import. If all the data are non-numeric,
they can be imported as text by selecting the radio button text. If the spreadsheet has a mix
of text and numeric values, and one data type is selected, only data of that type will be
imported.

Skip lines
If there are rows of data at the top of the spreadsheet that you do not want to import, you
can use the Skip lines option to enter the number of lines from the top to skip.

5.8. GRAMS
5.8.1 GRAMS from Thermo Scientific
Type of data
Array
Data dimensions
Multiple spectra, constituents
Software
GRAMS
Vendor
Thermo Scientific (formerly Galactic)
File name extension
*.spc, *.cfl

 File format information


 How to use it

5.8.2 About the GRAMS data format


This format is from GRAMS, a software package developed by Galactic (now part of Thermo
Scientific), and available for data from many different instruments.
The data are stored in two different file types. Spectra are stored in binary files with the
.spc file name extension, and constituents are stored in ASCII files with the .cfl file name
extension. The two file types are connected so that if a .cfl file is imported into The

144
Import

Unscrambler® both spectra and constituents are read. If a .spc file is imported, the spectra
are read, and accompanying Y values can also be imported with them.
“X-values” (usually wavelengths) in .spc files are imported as X-variable names.
Constituents in .cfl files are imported as Y-variables. “Y-values” are imported as separate
column sets with the name of the Y values for the columns.
Some .spc files contain a log block. This may include file names and sample numbers. To
import these, one can select Sample naming… and designate whether to use one, both or
none of these fields.
The binary part of the log block (which usually contains the imaginary part of complex
spectral data) is not imported, nor is the ASCII part of the log.

5.8.3 File – Import Data – GRAMS…


One or several GRAMS .spc files can be imported into a project in The Unscrambler®.

How to import data


Select the files to import from the file list in the GRAMS Import dialog box or use the Browse
button to obtain a list of available files. The different files must have the same number of X-
variables and the same contents in the Y-matrix to allow simultaneous import.
GRAMS Import

The source files may contain one or more samples per file (i.e. single spectra or multifiles1);
multiple selections allow one to import several samples with the same number of variables
at the same time. The dialog will include details about the files that are eligible for import. It
will show the number of samples per file, the number of X variables, number of Y variables,
and the starting and ending X variables.

145
The Unscrambler X Main

Multiple selections
Select one or more files to import by checking the check box next to each file, or by using
Auto select matching spectra.
The contents of all the selected spectra will be merged to create one data matrix during
import. If the data files also include Y values, these will also be imported.
Deselect all
Clear the current selection by unselecting all samples.
Preview spectra
Check to review a plot of selected spectra before importing.
Sample naming…
Include sample names or sample numbers in the resulting data table.
Sample names will only be imported if they are present in the source file.

Interpolate
By checking the Interpolate option this allows the import of data with different
starting and ending points, provided the number of points is the same in all sets to
be imported.
When the % button is selected, the following dialog appears allowing a user to set
the Tolerance for allowing data with different start or end points to be imported.
Interpolate Tolerance Dialog

For more information see the section on Import_Interpolate

Auto select matching spectra


The Auto select matching spectra preview option allows the automatic selection of the all
data file(s) with the same wavelength ranges as the current selection. A screenshot of the
GRAMS Import dialog with the auto select chosen is provided below.

146
Import

Once the Auto select matching spectra option has been checked it will select only those files
that have the same number of variables as the first selected file.
Use the Interpolate option to import data with different start or end points.

Sorting data
The file name, number of samples, number of X-variables, wavelengths for the first and last
X-variables are displayed for each file.
The data table resulting from the import can be sorted based on any of these columns in the
file list. Click on a column header to set sort order, and a second time to reverse the sort
order.

Preview
Preview spectra displays a line plot of selected files that have been selected for import.

147
The Unscrambler X Main

 Multifiles are a specific kind of GRAMS file that has multiple spectra in a single file,
as opposed to a single spectrum per file.

5.9. GuidedWave
5.9.1 CLASS-PA & SpectrOn from Guided Wave
Type of data/instrument
spectrometer (UV, UV-vis, NIR)
Data dimensions
Single spectra, constituents
Instrument/hardware
CLASS-PA, SpectrOn
Vendor

148
Import

Guided Wave
File name extension
*.asc, *.scn, *.autoscan, *.gva

 File format information


 How to use it

5.9.2 About Guided Wave CLASS-PA & SpectrOn data files


This option allows one to import data from Guided Wave instruments. The data files
typically have the extension .asc, .scn, .autoscan, or .gva but may be another extension as
the file type is not defined strictly by the extension.

5.9.3 File – Import Data – CLASS-PA & SpectrOn…


This option allows a user to import data from Guided Wave instrument files with the
following file name extensions: .asc, .scn, .autoscan.

How to import data


From the File – Import Data menu, select CLASS-PA & SpectrOn. The Guided Wave dialog
box displays a list of files from which one can import CLASS-PA & SpectrOn data. If
necessary, click the Browse button to access files from a different folder.
CLASS-PA & SpectrOn import

Multiple selections are possible, by checking the box next to more than one file. The
selected samples must be of the same size (variables must match).

149
The Unscrambler X Main

Multiple selections
Select one or more files to import by checking the check box next to each file, or by using
Auto select matching spectra.
The contents of all the selected spectra will be merged to create one data matrix during
import.
Deselect all
Clear the current selection by unselecting all samples.
Preview spectra
Check to review a plot of selected spectra before importing.
Sample naming…
Include sample names, sample numbers or timestamps in the resulting data table.
Sample names will only be imported if they are present in the source file.

Interpolate
By checking the Interpolate option this allows the import of data with different
starting and ending points, provided the number of points is the same in all sets to
be imported.
When the % button is selected, the following dialog appears allowing a user to set
the Tolerance for allowing data with different start or end points to be imported.
Interpolate Tolerance Dialog

For more information see the section on Import_Interpolate

Y-variables
Constituents may also be imported by checking the following options:

 Import Y-variables
 Import Predicted Y-variables

150
Import

Auto select matching spectra


The Auto select matching spectra preview option allows the automatic selection of all data
file(s) with the same wavelength ranges as the current selection. This dialog is used by
spectral data imports from instrument formats such as CLASS-PA & SpectrOn GRAMS, OPUS,
etc. A screenshot of the Guided Wave Import dialog box with the auto select option chosen
is given below.

Use the Interpolate option to import data with different start or end points.

Sorting data
The file name, number of X-variables, wavelengths for the first and last X-variables, and step
(increase in wavelength), are displayed for each file.
The data table resulting from the import can be sorted based on any of these columns in the
file list: Click on a column header to set sort order, and a second time to reverse the sort
order.

Preview
Preview spectra displays a line plot of selected files that have been selected for import.

151
The Unscrambler X Main

5.10. Import Interpolate


5.10.1 Interpolate functionality
It is the common case, particularly with Fourier Transform (FT) spectrometers, when data is
collected on different instruments (of the same make), even though they have been
collected at the same resolution the starting and ending wavenumbers may be slightly
different.
When data is imported into The Unscrambler®, the import dialog relies on three important
pieces of information

 Number of wavelengths/wavenumbers (points) in the spectrum


 The starting value of the spectra
 The ending value of the spectra

152
Import

If there is a mismatch in any of these values, there are two possible scenarios

 If the number of points in the spectra do not match to each other, a matrix cannot
be formed as it does not have the same column dimension
 If the start points do not match, again a matrix cannot be formed, however, if the
differences between the values are small, interpolation can be used to match these
small differences.

The Interpolation function used in the Import menus is different from that found in Tasks -
Transform (which may be useful for trying to match data from two sets collected as different
resolutions).
Find out more about the Interpolate Transform here.
Data Imports Supporting Interpolation
The following file imports support the interpolate functionality in The Unscrambler® import
dialog boxes.

 JAMP-DX
 Thermo Galactic GRAMS
 OPUS (Bruker Optics)
 CLASS-PA & SpectrOn
 Indico (ASD)
 OMNIC™ (Thermo)
 Varian
 PerkinElmer

Functionality
When a file import supporting interpolate is selected, the Interpolate checkbox will be
present, see below

The % button opens the Tolerance dialog box that has a slider bar for setting how far
beyond the reference spectrum limit to set the interpolation.
Tolerance Dialog

Any points that lie within +/- the set percentage tolerance of the starting point will be
included in the import.
Example
Nine Spectra were collected on three different Bruker spectrometers using 8 wavenumber
resolution. Three replicate spectra were collected on each instrument. Each spectrum

153
The Unscrambler X Main

consists of 1154 points, however, the starting point of each spectrum is different. By
selecting the first spectrum and then checking the Auto select matching spectra box, only
the three first spectra are selected, see below,

To import all data into one table, check the Interpolate box and set the Tolerance to include
all spectra in the set, see below

When the Auto select matching spectra box is reselected, all spectra are now included in the
import, see below,

154
Import

The data are now displayed as a node in the project navigator using the column headers of
the reference spectrum selected.

5.11. Indico
5.11.1 Indico
Type of data/instrument

Data dimensions
Single spectra
Software
Indico Pro 5.6 (version 6 files)
RS3 5.6 (version 7 files)
Indico Pro 6.0 (version 8 files)
Vendor
ASD Inc.
File name extension
*.asd, *.001, *.002, *.3456, etc. (any number)

 File format information


 How to use it

5.11.2 About ASD Inc. Indico data files


This option allows for the import of data files created with the ASD Inc software. Current
ASD files that are supported for import are version 6, generated from Indico Pro 5.6, version
7, generated from RS3 5.6, and version 8 generated from Indico Pro 6.0.

155
The Unscrambler X Main

5.11.3 File – Import Data – Indico…


This option allows a user to import data files created with the ASD Inc. software Indico Pro
and RS3. Source files with the following file name extensions are supported: .asd, .001,
.002, .3456, etc. (any number).

How to import data


Select the files to import from the file list in the Indico Import dialog box or use the Browse
button to obtain a list of available files. The Indico Import dialog box displays a list of files
from which one may import Indico data. This includes the file names, the number of X-
variables, names of the First and Last X-variables and step size.
INDICO Import

The source files contain one sample per file; multiple selection allows for the import of
several files (samples) at the same time.

Multiple selections
Select one or more files to import by checking the check box next to each file, or by using
Auto select matching spectra.
The contents of all the selected spectra will be merged to create one data matrix during
import.
Deselect all
Clear the current selection by unselecting all samples.
Preview spectra
Check to review a plot of selected spectra before importing.
Sample naming…
Include sample names or sample numbers in the resulting data table.
Sample names will only be imported if they are present in the source file.

156
Import

Interpolate
By checking the Interpolate option this allows the import of data with different
starting and ending points, provided the number of points is the same in all sets to
be imported.
When the % button is selected, the following dialog appears allowing a user to set
the Tolerance for allowing data with different start or end points to be imported.
Interpolate Tolerance Dialog

For more information see the section on Import_Interpolate

Auto select matching spectra


The auto select matching spectra preview option allows the automatic selection of all data
file(s) with the same wavelength ranges as the current selection. This dialog is used by
spectral data imports from instrument formats such as Indico, GRAMS,OPUS etc. A
screenshot of the Indico Import dialog with the auto selection chosen is given below.

157
The Unscrambler X Main

Use the Interpolate option to import data with different start or end points.

Sorting data
The file name, number of X-variables, wavelengths for the first and last X-variables, and step
(increase in wavelength), are displayed for each file.
The data table resulting from the import can be sorted based on any of these columns in the
file list: Click on a column header to set sort order, and a second time to reverse the sort
order.

Preview
Preview spectra displays a line plot of selected files that have been selected for import.

158
Import

5.12. JcampDX
5.12.1 JCAMP-DX
Type of data/instrument
Vector and arrays. Standard
Data dimensions
Multiple spectra, constituents
Vendor
JCAMP/IUPAC
File name extensions
*.jdx, *.dx, *.jcm

 File format information


 How to use it

159
The Unscrambler X Main

5.12.2 About the JCAMP-DX file format


This is a standard, portable data format defined by JCAMP to support exchange of chemical
and spectroscopic information.
It was originally a standard data format for IR, which has since been extended to
accommodate NMR, mass spec and other data, motivated by the desire to share data
irrespective of the spectrometer on which it was acquired and the need for long-term data
archival, well past the expected lifetime of current hardware and software.
Further development of JCAMP standards is now under the auspices of IUPAC.

5.12.3 File – Import Data – JCAMP-DX…


One can import one or several JCAMP-DX files with .jdx, .dx, .jcm file name extensions
into a project in The Unscrambler®.

How to import data


Select the files to import from the file list in the JCAMP-DX Import dialog box or use the
Browse button to get a list of available files.
The different files must have the same number of X-variables and the same contents in the
Y-matrix to allow simultaneous import.
JCAMP-DX Import

Multiple selections
Select one or more files to import by checking the check box next to each file, or by using
Auto select matching spectra.
The contents of all the selected spectra will be merged to create a one data matrix during
import.
Deselect all

160
Import

Clear the current selection by unselecting all samples.


Preview spectra
Check to review a plot of selected spectra before importing.
Sample naming…
Include sample names or sample numbers in the resulting data table.
Sample names will only be imported if they are present in the source file.

Interpolate
By checking the Interpolate option this allows the import of data with different
starting and ending points, provided the number of points is the same in all sets to
be imported.
When the % button is selected, the following dialog appears allowing a user to set
the Tolerance for allowing data with different start or end points to be imported.
Interpolate Tolerance Dialog

For more information see the section on Import_Interpolate

Auto select matching spectra


The Auto select matching spectra preview option allows the automatic selection of all data
file(s) with the same wavelength ranges as the current selection.

161
The Unscrambler X Main

Use the Interpolate option to import data with different start or end points.

Sorting data
The file name, number of samples, number of X variables, number of Y variables, and
wavelengths for the first and last X-variables are displayed for each file.
The data table resulting from the import can be sorted based on any of these columns in the
file list: Click on a column header to set sort order, and a second time to reverse the sort
order.

Preview
Preview spectra displays line plots of selected files for import.

162
Import

5.12.4 JCAMP-DX file format reference


This format is used by many spectroscopy instrument vendors, e.g. Bran+Luebbe
(IDAS/Infralyzer), NIRSystems (NSAS), Perkin Elmer, Thermo Fisher (Grams, Omnic), Bruker
(OPUS), etc.

General
JCAMP-DX are ASCII-files with file headers containing information about the data and their
origin, etc., and they may contain both X-data (spectra) and Y-data (concentrations).
Only the most essential information of the JCAMP-DX file will be imported. The first title in
the JCAMP-DX file will be used, and one has the additional option of also importing file
names and sample numbers. There is not a limit on the length of a file name. If several
JCAMP-DX files are imported and saved in the same Unscrambler® file, the matrix name will
be that of the first file imported JCAMP-DX file.
JCAMP “X-values” (usually wavelengths) become X-variable names, while JCAMP “Y-values”
become X-variable values. “Concentrations” are interpreted as Y-variables. Variable names
are imported, with no limit on the number of characters. The “Sample description” are used

163
The Unscrambler X Main

as sample names. Unfortunately there are different dialects of JCAMP-DX, so in some cases
one may lose e.g. sample names if they were used erroneously in the original file.
The XYPOINT variant demand more disk space than XYDATA.
Examples of the XYDATA and XYPOINTS formats follows.

JCAMP-DX XYPOINTS
The example below shows only one sample.

##TITLE= DMCAL.DAT to DMCAL19.DAT using FILTER1.DAT wavelengths


##JCAMP-DX= 4.24 $IDAS 1.40
##DATA TYPE= NEAR INFRARED SPECTRUM
##ORIGIN= Bran+Luebbe Analyzing Technologies
##OWNER= Applications Laboratory
##DATE= 92/ 6/10 $$ WED
##TIME= 1: 0: 3
##BLOCKS= 14
##SAMPLE DESCRIPTION= WHE202CH $$ 1.00
##SAMPLING PROCEDURE= DIFFUSE REFLECTION
##DATA PROCESSING= LOG(1/R)
##XUNITS= NANOMETERS
##YUNITS= ABSORBANCE
##XFACTOR= 1.0
##YFACTOR= 0.000001
##FIRSTX= 1445
##LASTX= 2348
##FIRSTY= 0.652170
##MINY= 0.552445
##MAXY= 1.258505
##NPOINTS= 19
##CONCENTRATIONS= (NCU)
(<CARBOHYDRATE>, 89.400, %)
(<PROTEIN>, 9.410, %)
##XYPOINTS= (XY..XY)
1445, 652170; 1680, 555209; 1722, 606660; 1734, 612745;
1759, 604142; 1778, 575455; 1818, 552445; 1940, 631510;
1982, 657704; 2100, 1188830; 2139, 1082772; 2180, 1008640;
2190, 999405; 2208, 951049; 2230, 978299; 2270, 1198344;
2310, 1258505; 2336, 1209149; 2348, 1153169;
##END=

JCAMP-DX XYDATA
The example below shows only one sample.

##TITLE= Infralyzer 500 (5 NM Intervals)


##JCAMP-DX= 4.24 $IDAS 1.40
##DATA TYPE= NEAR INFRARED SPECTRUM
##ORIGIN= Bran+Luebbe Analyzing Technologies
##OWNER= Applications Laboratory
##DATE= 92/ 7/ 9 $$ THU
##TIME= 20:53:17
##BLOCKS= 14
##SAMPLE DESCRIPTION= COF12BUS $$ 1.00
##SAMPLING PROCEDURE= DIFFUSE REFLECTION
##DATA PROCESSING= LOG(1/R)

164
Import

##XUNITS= NANOMETERS
##YUNITS= ABSORBANCE
##XFACTOR= 1.0
##YFACTOR= 0.000001
##FIRSTX= 1100
##LASTX= 2500
##FIRSTY= 0.139460
##MINY= 0.131600
##MAXY= 1.380070
##NPOINTS= 281
##CONCENTRATIONS= (NCU)
(<CARBOHYDRATE>, 89.400, %)
(<PROTEIN>, 9.410, %)
##DELTAX= 5
##XYDATA= (X++(Y..Y))
1100 139459 137435 135089 133060 131669 131599 133794 138899
1140 145740 151897 158459 167527 180800 195522 206585 216499
...
...
2460 1378929 1379632 1378464 1374972 1378929 1376837 1372945 1377632
2500 1380069
##END=

Instrument parameters for JCAMP files


The appropriate parameters in this field will be written to the JCAMP exported file.
Please feel free to include more parameters in the file if necessary . The user can type any
information into the field, but only text in the format ##KEYWORD = ..., as listed below, will
be used during export.
JCAMP keywords
Keyword Legal values

AVERAGE= INTEGER*4 > 0

GAIN= REAL*4 >= 0.0

BASELINEC= YES or NO

APCOM= String60

JCAMP-DX= String

ORIGIN= String

5.13. Konica_Minolta
5.13.1 Konica_Minolta
Type of data/instrument
KONICA MINOLTA NIR spectrometer
Data dimensions
single vector spectrum or multiple spectra in an array
Instrument/hardware : :
Vendor

165
The Unscrambler X Main

Konica_Minolta
File name extension :

 File format information


 How to use it

5.13.2 About Konica_Minolta data files


This option allows for the import of data files created with KONICA MINOLTA NIR
spectrometer.

5.13.3 File – Import Data – Konica_Minolta…


This option allows a user to import data files from KONICA MINOLTA NIR spectrometer. This
option would directly connect the spectrometer and acquire data. This import also supports
ASCII file import.

How to import data


Select the ASCII files to import from Import Button in the Konica_Minolta Import dialog box.
Konica_Minolta Import

Upon selection of ASCII files the spectrum is displayed in the dialog box as a line plot. After
selecting multiple files user can click on OK to get the data in Import.
Konica_Minolta Import

166
Import

To get the data directly from instrument click on “Scan” button.


The contents of all the spectra in dialog will be merged to create one data matrix after
import.
Delete
Deletes the selected spectra
Rename
Option to rename the name of spectra
Select/DeSelect
Use Mouse left button to select/unselect the spectra for viewing the plots

5.14. Matlab
5.14.1 Matlab
Type of data
Array
Software
Matlab
Vendor
MathWorks, Inc.
File name extension
*.mat

 File format information


 How to use it

167
The Unscrambler X Main

5.14.2 About Matlab data files


MATLAB is a numerical computing environment and fourth generation programming
language.
The Unscrambler® allows for the import of data from Matlab data files created with Matlab
versions 5.x to 7.0.

What cannot be converted


The following cannot be imported from Matlab to The Unscrambler®

 Matrices containing imaginary numbers,


 Cells arrays,
 Structures,
 Sparse matrices.

To save data for importing


Use the save command in Matlab:

 either save destinationfilename var1 var2 ... ,


 or save destinationfilename to save all variables in the workspace.

This will create a Matlab formatted .mat file. For more help on using the save command,
type help save in Matlab.

5.14.3 File – Import Data – Matlab…


This option allows for the import of data from Matlab formatted files created in Matlab
versions 5.x to 7.0.

How to import data into The Unscrambler®


To import the file in The Unscrambler® select File - Import Data - Matlab. Select the
destination filename in The Unscrambler® to get the Import Matlab dialog box.
Select which selections represent the Data, Sample names and Variable names. The sample
name and variable name variables must match the corresponding dimension of the data
variable (for example, 5 rows and 4 columns in the figure below) or they will not be
displayed in the drop-down lists with available sample and variable names.
Import Matlab dialog

168
Import

Matlab variables representing sample and variable names must be character arrays.
What Cannot be Converted
The following cannot be imported from Matlab to The Unscrambler®

 Matrices containing imaginary numbers,


 Cells arrays,
 Structures,
 Sparse matrices.

To Save Data for Importing


Use the save command in Matlab:

 either save destinationfilename var1 var2 ... ,


 or save destinationfilename to save all variables in the workspace.

This will create a Matlab formatted .mat file. For more help on using the save command,
type help save in Matlab.

5.15. MyInstrument
5.15.1 MyInstrument
Type of data/instrument
Instrument interface standard defined by Thermo Electron (formerly Galactic) and
supported by many instrument vendors.
A MyInstrument driver provided by the specific instrument vendor and the
corresponding MyInstrument add-on for The Unscrambler® are required. These
modules are available separately from CAMO Software and many not be part of the
standard package.

 Additional information
 How to use it

5.15.2 About the MyInstrument standard


The MyInstrument add-on for The Unscrambler® provides users with the ability to directly
acquire spectra from their spectrometers into The Unscrambler®. The acquisition process

169
The Unscrambler X Main

makes use of the MyInstrument standard to allow for instrument configuration and
definition of experiments in order to run scans. The functionality provided is dependent on
the instrument. After acquisition the spectral data is directly inserted as rows per scan into
an The Unscrambler® editor, ready for further processing or modeling. The MyInstrument
add-on removes the need for acquiring data using other instrument specific software, saving
to a file and then importing into The Unscrambler®.

5.15.3 File – Import Data – MyInstrument…


Working with the MyInstrument add-on
Start a session in The Unscrambler® and use the menu item which typically has the vendor
company name followed by MyInstrument…, e.g. for a Zeiss instrument: File – Import Data –
Zeiss MyInstrument…

The next window will show the vendor specific MyInstrument control screen, e.g. for a Zeiss
instrument:

170
Import

The appearance and usage of the control dialog will depend on the particular instrument
vendor. Details of using the instrument interface will be available from the manuals provided
by the instrument vendor. Using the instrument may require specific configuration and
setup procedures provided by the vendor before being able to run scans.

171
The Unscrambler X Main

Sample scan result. This may appear entirely different for the instrument being used and are
provided here only as an example.
Click OK to end the scan acquisition session. The scans should now be available within The
Unscrambler® editor for subsequent processing and modeling.

172
Import

5.16. NetCDF
5.16.1 NetCDF
Type of data
Open standard for array-oriented data
Developed by
University Corporation for Atmospheric Research (UCAR)
File name extension
*.cdf, *.nc

 File format information


 How to use it

5.16.2 About the NetCDF file format


NetCDF (network Common Data Form) is a set of software libraries and machine-
independent data formats that support the creation, access, and sharing of array-oriented
scientific data.
What Is NetCDF?
NetCDF (network Common Data Form) is a set of interfaces for array-oriented data access
and a freely-distributed collection of data access libraries for C, Fortran, C++, Java, and other
languages. The NetCDF libraries support a machine-independent format for representing
scientific data. Together, the interfaces, libraries, and format support the creation, access,
and sharing of scientific data.
NetCDF data is:

 Self-Describing. A NetCDF file includes information about the data it contains.


 Portable. A NetCDF file can be accessed by computers with different ways of storing
integers, characters, and floating-point numbers.
 Scalable. A small subset of a large data set may be accessed efficiently.
 Appendable. Data may be appended to a properly structured NetCDF file without
copying the data set or redefining its structure.
 Sharable. One writer and multiple readers may simultaneously access the same
NetCDF file.
 Archivable. Access to all earlier forms of NetCDF data will be supported by current
and future versions of the software.

The NetCDF software was developed by Glenn Davis, Russ Rew, Ed Hartnett, John Caron,
Steve Emmerson, and Harvey Davies at the Unidata Program Center in Boulder, Colorado,
with contributions from many other NetCDF users.

5.16.3 File – Import Data – NetCDF…


NetCDF (network Common Data Form) is a set of software libraries and machine-
independent data formats that support the creation, access, and sharing of array-oriented
scientific data.

173
The Unscrambler X Main

How to import data


Select the files to import from the file list in the dialog NetCDF Import or use the Browse
button to get a list of available files.
Select a .cdf file to import and then click Open.
NetCDF Import dialog

One can select Sample Names and Variable names as shown above.

5.17. NSAS
5.17.1 NSAS
Type of data/instrument
NIR
Data dimensions
Multiple spectra, constituents
Instrument/hardware
Foss 5000, 6500, XDS
Vendor
FOSS
File name extension
*.da, *.cn, *.cal

 File format information


 How to use it

5.17.2 About the NSAS file format


NSAS file format originates from FOSS NIRSystems NIR instruments, and is a format from
their DOS-based NSAS software. Files can be saved from the FOSS WINISI software and FOSS
Vision software into the NSAS format.
See the technical reference for an overview of instrument parameters that The
Unscrambler® can import from NSAS data files.

174
Import

5.17.3 File – Import Data – NSAS…


NSAS data import allows the import of NIR spectral data files generated by FOSS instruments
and accompanying constituents from the NSAS file format, which have the .da and .cn file
name extensions respectively.

How to import data


Select the files to import from the file list in the dialog NSAS Import or use the Browse
button to get a list of available files. The different files must have the same number of X-
variables and the same contents in the Y-matrix to allow simultaneous import.
NSAS Import

The source files may contain one or more samples per file; multiple selections allow several
samples to be imported at the same time.

Multiple selections
Select one or more files to import by checking the check box next to each file, or by using
Auto select matching spectra.
The contents of all the selected spectra will be merged to create a one data matrix during
import.
Deselect all
Clear the current selection by unselecting all samples.
Preview spectra
Check to review a plot of selected spectra before importing.
Sample naming…
Include sample names or sample numbers in the resulting data table.
Sample names will only be imported if they are present in the source file.

175
The Unscrambler X Main

Auto select matching spectra


Auto select matching spectra preview option provides the automatic selection of all data
file(s) with the same wavelength ranges as the current selection. This dialog is used by input
spectral data from instruments with NSAS file format, as well as others. A screenshot of the
NSAS Import dialog with the auto select option chosen is given below.

Once Auto select matching spectra has been checked it will select the files having the same
number of variables from the list.

Sorting data
The file name, number of samples, number of X-variables, wavelengths for the first and last
X-variables are displayed for each file.
The data table resulting from the import can be sorted based on any of these columns in the
file list: Click on a column header to set sort order, and a second time to reverse the sort
order.

176
Import

Preview
Preview spectra displays a line plot of selected files that have been selected for import.

5.17.4 NSAS file format reference


This document describes the instrument parameters that can be imported from NSAS data
files. Files can be saved from the FOSS WINISI software and FOSS Vision software into the
NSAS format.
Instrument parameters from NSAS files
NSAS Data Import will read information in the NSAS data file which has no natural place in
The Unscrambler® file format into the Instrument Info block under specific keywords.
Similarly, NSAS/Vision Model Export will look for a relevant subset of these keywords and, if
found, it will place the values in the corresponding places in the NSAS/Vision Model file.
The NSAS/Vision keywords are listed below.
NSAS/Vision keywords
Keyword Legal values

177
The Unscrambler X Main

Keyword Legal values

NSAS_InstrumentModel String representing integer > 0

NSAS_AmpType String: 1

NSAS_CellType String: 2

NSAS_Volume String: 3

NSAS_NumScans String representing integer > 0

NSAS_HasSampleTransport String: Yes/No

NSAS_ReferenceAcquiredInRefPos String: Yes/No

NSAS_SampleAcquiredInSamPos String: Yes/No

NSAS_OnlineInstrument String: Yes/No

NSAS_Math1_Type String representing integer > 0: 4

NSAS_Math2_Type =

NSAS_Math3_Type =

NSAS_Math1_SegmentSize String representing integer > 0

NSAS_Math2_SegmentSize =

NSAS_Math3_SegmentSize =

NSAS_Math1_GapSize String representing integer > 0

NSAS_Math2_GapSize =

NSAS_Math3_GapSize =

NSAS_Math1_DivisorPoint String representing integer > 0

NSAS_Math2_DivisorPoint =

NSAS_Math3_DivisorPoint =

NSAS_Math1_SubtractionPoint String representing integer > 0

NSAS_Math2_SubtractionPoint =

NSAS_Math3_SubtractionPoint =

NSAS_NumberOfConstituents String representing integer > 0

NSAS_NumberOfDataPoints String representing integer > 0

NSAS_StartingWaveLength String representing integer > 0

NSAS_EndWaveLength String representing integer > 0

178
Import

Keyword Legal values

NSAS_CreationDay String representing integer > 0

NSAS_CreationMonth String representing integer > 0

NSAS_CreationYear String representing integer > 0

NSAS_CreationHour String representing integer > 0

NSAS_CreationMinute String representing integer > 0

NSAS_CreationSecond String representing integer > 0

 NSAS_AmpType | String:
“Reflectance”, “Transmittance”, “(Reflect/Reflect)”, “(Reflect/Transmit)”,
“(Transmit/Reflect)”, “(Transmit/Transmit)”, “Not used”

 NSAS_CellType | String:
“Standard sample cup”, “Manual”, “Web analyzer”, “Coarse sample”, “Remote
reflectance”, “Powder module”, “High fat/moisture”, “Rotating drawer”, “Flow-
through liquid”, “Cuvette”, “Paste cell”, “Cuvette cell”, “3 mm liquid cell”, “30 mm
liquid cell”, “Coarse sample with sample dump”

 NSAS_Volume | String:
“1/4 full”, “1/2 full”, “3/4 full”, “Completely full”

 NSAS_Math[1-3]_Type | String representing integer > 0:


1 = “N-point smooth”, 2 = “Reflective energy”, 3 = “Kubelka-Munk”, 4 = “1st
derivative”, 5 = “2nd derivative”, 6 = “3rd derivative”, 7 = “4th derivative”, 8 =
“Savitsky & Golay”, 9 = “Divide by wavelength”, 10 = “Fourier transform”, 11 =
“Correct for reference changes”, 13 = “Full MSC”, 21 = “N-point smooth”, 22 = “1st
derivative”, 23 = “2nd derivative”, 31 = “Savitzky-Golay first derivative”

5.18. Omnic
5.18.1 OMNIC
Type of data/instrument
FTIR, FT-NIR, Raman
Data dimensions
Single spectra

179
The Unscrambler X Main

Instrument/hardware
Nicolet IR, Antaris, NXR
Vendor
Thermo Scientific (Nicolet)
File name extension
*.spa, *.spg

 File format information


 How to use it

5.18.2 About Thermo OMNIC data files


Data generated by Thermo molecular spectroscopy instruments and related OMNIC
software.

5.18.3 File – Import Data – OMNIC…


This option allows for the import of data from OMNIC files generated by ThermoFisher
instruments and related software.
Source files with .spa or .spg file name extension are supported.

How to import data


Selecting the OMNIC dialog box displays a list of files from which one can import OMNIC
data.
If necessary, click the Browse button close to the Look in: field in order to access files from a
different folder.
OMNIC Import

The source files contain one sample per file. Multiple selection allows several files (samples)
to be imported at the same time.

180
Import

Multiple selections
Select one or more files to import by checking the check box next to each file, or by using
Auto select matching spectra.
The contents of all the selected spectra will be merged to create a one data matrix during
import.
Deselect all
Clear the current selection by unselecting all samples.
Preview spectra
Check to review a plot of selected spectra before importing.
Sample naming…
Include sample names or sample numbers in the resulting data table.
Sample names will only be imported if they are present in the source file.

Interpolate
By checking the Interpolate option this allows the import of data with different
starting and ending points, provided the number of points is the same in all sets to
be imported.
When the % button is selected, the following dialog appears allowing a user to set
the Tolerance for allowing data with different start or end points to be imported.
Interpolate Tolerance Dialog

For more information see the section on Import_Interpolate

Auto select matching spectra


Auto select matching spectra preview option allows the automatic selection of all data file(s)
with the same wavelength ranges as the current selection. This dialog is used by input
spectral data from instruments with OMNIC file format. A screenshot of the OMNIC Import
dialog with the auto select chosen is given below.

181
The Unscrambler X Main

Once the Auto select matching spectra option has been checked it will select the files have
the same variables from the list.
Use the Interpolate option to import data with different start or end points.

Sorting data
The file name, number of X-variables, wavelengths for the first and last X-variables, and step
(increase in wavelength), are displayed for each file.
Step is the increment in wavelength (or wave number) between two successive variables.
The following relationship should be true:

First X-var + Step\*(Xvars-1) = Last X-var


The data table resulting from the import can be sorted based on any of these columns in the
file list: Click on a column header to set sort order, and a second time to reverse the sort
order.

Preview
Preview spectra displays a line plot of selected files that have been selected for import.

182
Import

5.19. OPC
5.19.1 OPC protocol
Type of data/instrument
Standard data transfer protocol
Vendor
OPC Foundation
 File format information
 How to use it

5.19.2 About the OPC protocol


OPC (originally OLE for process control) is a non-proprietary technical specification created
with the collaboration of a number of leading worldwide automation hardware and software
suppliers, working in cooperation with Microsoft under the auspices of the OPC Foundation.
The original standard provided specifications for process data acquisition, making possible
interoperability between automation/control applications, field systems/devices and

183
The Unscrambler X Main

business/office applications. The standard defines methods for exchanging real-time


automation data between PC-based clients using Microsoft operating systems. In 2009 a
new standard, OPC Unified Architecture, was developed, providing specifications for cross-
platform capability .
An OPC Server is often referred to as an OPC Driver. The two terms are synonymous.
An OPC Server is a software application that acts as an API (Application Programming
Interface) or protocol converter. An OPC Server will connect to a device such as a PLC, DCS,
RTU, or a data source such as a database or User interface, and translate the data into a
standard-based OPC format. OPC compliant applications such as a HMI (Human Machine
Interface), historian, spreadsheet, trending application, etc can connect to the OPC Server
and use it to read and write device data. An OPC Server is analogous to the role a printer
driver plays to enable a computer to communicate with an ink jet printer. An OPC Server is
based on a Server/Client architecture.

5.19.3 File – Import Data – OPC…


Data can be imported into The Unscrambler® via OPC. This requires a connection with an
OPC server. Begin by selecting File – Import Data – OPC… to open the OPC Dialog menu.
OPC Dialog

All configured servers on the PC will be recognized, and displayed in the list of OPC servers.
The user must make selections for the Computer name/IP, the OPC Server, and the OPC
Group from the respective drop-down lists. The user also has provision to type in computer
name/IP, the OPC server, and the OPC Group. Once they have been selected, available items
will be given in the OPC Items list. An item is selected, and by clicking on GO, the data will be
generated from OPC, and populate the fields in the OPC Import Dialog. Click Stop to stop the
collection process from OPC, showing the data in the preview.
OPC Tag - The user should use this option to specify the OPC tag. This should be used when
more OPC groups and OPC items are available in Servers. The user can directly specify the
tag to avoid the delay in listing and selecting individual OPC group and OPC item.

184
Import

Update Rate - This is the rate(in milliseconds) at which data is retrieved from the OPC
Server.
Show preview - User should check this option to see the last 10 rows retrieved from the OPC
Server.
Set number of columns - The user should use this option to increase the number of
columns.
Filled OPC Dialog

Click OK to complete the import of the data into The Unscrambler®.

5.20. OSISoftPI
5.20.1 PI
Type of data
PI Server - real time data collection, archiving and distribution engines

 File format information


 How to use it

5.20.2 About supported interfaces


PI Import is an add-in that retrieves tags from compiled PI archives and servers, and writes
the data in The Unscrambler workbook which can then be used for regular plotting,
transformation and multivariate analysis. Tags are unique storage points for the data in the
PI system. Each tag is simply a single point of measurement.

5.20.3 File – Import Data – PI…


Data can be imported into The Unscrambler® via OSISoft PI.

185
The Unscrambler X Main

The PI Import dialog allows the user to specify and connect to an active server. Click Add to
search a PI Server for tags using the Tag Search dialog. This dialog allows the user to search
all connected PI Servers for tags meeting a given a set of criteria, such as one or more tag
attribute values. Tags can be selected using the Search option. Three different search
options are available in Tag Search dialog, the Basic, Advanced and Alias.
Tag Search dialog

After the tags are selected (use Ctrl key for multiple tag selection) from the search list panel
and OK is clicked, they can be seen in the Tags window of the PI Import dialog. For more
details on options available in Tag Search dialog box, click on Help.
The below three sections describe the data modes to go through in order to preview and
retreive data for the selected tags from the PI server.

Data Mode: Archive


This mode will search the archive data specified within time ranges. For each tag, the values
recorded in the PI data source will be retrieved, within the specified time range and
previewed in the preview list. The timestamp (for the specified tag in Tag No) can either be
imported as row header or first column from the tag.
Data Mode, Archive

186
Import

Data Mode: Polling


The polling mode retrieves fresh data based on timer-driven method for any of the three
events selected. The time interval can be selected in seconds and the Start Timer option will
watch for new data. For each tag, the new values recorded in the PI data source will be
retrieved, and can be previewed in the preview list. The timestamp (for the specified tag in
Tag No) can either be imported as row header or first column from the tag.
Data Mode, Polling

187
The Unscrambler X Main

Data Mode: Event


The event driven method retrieves fresh data based on any of the three events selected. The
Start Monitoring option will watch for new data. For each tag, the new values recorded in
the PI data source will be retrieved, and can be previewed in the preview list. The timestamp
(for the specified tag in Tag No) can either be imported as row header or first column from
the tag.
Data Mode, Event

188
Import

The help option available in the PISDKUtility provides more details about the usage of PI-SDK
configuration utility.

5.21. PerkinElmer
5.21.1 PerkinElmer
Type of data/instrument
UV-Vis, NIR, FTIR, Raman
Data dimensions
Multiple spectra
Instrument/hardware

Software
Spectrum 6, Spectrum 10
Vendor
PerkinElmer
File name extension
*.sp, *.spp

 File format information


 How to use it

189
The Unscrambler X Main

5.21.2 About PerkinElmer instrument files


One or several spectra from files generated by PerkinElmer molecular spectroscopy
instruments (FTIR, Raman and UV-vis) using Spectrum 6 and Spectrum 10 software can be
imported.
When multiple spectra are contained in a file, the preference is to import the normalized
spectrum. However if a file contains a single spectrum (sample or reference alone), then
these will be imported.

5.21.3 File – Import Data – PerkinElmer…


This option supports the import of data from files generated by some PerkinElmer
instruments.
In the PerkinElmer Import dialog box, one can choose a folder where files are stored. A list
of files from which data can be imported is then displayed.
Note: Multiple files that vary in their spectral range and resolution cannot be
imported together.

How to import data


Select the files to import from the file list in the dialog or use the Browse button to get a list
of available files. The different files must have the same number of X-variables to allow
simultaneous import.
PerkinElmer Import

Multiple selections
Select one or more files to import by checking the check box next to each file, or by using
Auto select matching spectra.

190
Import

The contents of all the selected spectra will be merged to create a one data matrix during
import.
Deselect all
Clear the current selection by unselecting all samples.
Preview spectra
Check to review a plot of selected spectra before importing.
Sample naming…
Include sample names or sample numbers in the resulting data table.
Sample names will only be imported if they are present in the source file.

Interpolate
By checking the Interpolate option this allows the import of data with different
starting and ending points, provided the number of points is the same in all sets to
be imported.
When the % button is selected, the following dialog appears allowing a user to set
the Tolerance for allowing data with different start or end points to be imported.
Interpolate Tolerance Dialog

For more information see the section on Import_Interpolate


Use the Interpolate option to import data with different start or end points.

Auto select matching spectra


The Auto select matching spectra preview option provides automatic selection of all data
file(s) with the same wavelength ranges as the current selection. This dialog is used for
import of spectral data from PerkinElmer instruments. A screenshot of the dialog with the
auto select option chosen is given below.

191
The Unscrambler X Main

Once Auto select matching spectra has been checked, the files in the list having the same
number of variables will be selected.
Use the Interpolate option to import data with different start or end points.

Sorting data
The file name, number of X-variables, wavelengths for the first and last X-variables are
displayed for each file.
The data table resulting from the import can be sorted based on any of these columns in the
file list: Click on a column header to set sort order, and a second time to reverse the sort
order.

Preview
Preview spectra displays a line plot of selected files that have been selected for import.

192
Import

5.22. PertenDX
5.22.1 Perten-DX
Type of data/instrument
Vector and arrays. Standard
Data dimensions
Multiple spectra, constituents
Vendor
Perten Instruments following JCAMP/IUPAC
File name extensions
*.jdx, *.dx, *.jcm

 File format information


 How to use it

193
The Unscrambler X Main

5.22.2 About the Perten Instruments JCAMP-DX file format


This is a standard, portable data format defined by JCAMP and modified by Perten to
support few of the specific Perten types
It was originally a standard data format for IR, which has since been extended to
accommodate NMR, mass spec and other data, motivated by the desire to share data
irrespective of the spectrometer on which it was acquired and the need for long-term data
archival, well past the expected lifetime of current hardware and software.
Further development of JCAMP standards is now under the auspices of IUPAC.

5.22.3 File – Import Data – Perten-DX…


One can import one or several Perten-DX files with .jdx, .dx, .jcm file name extensions
into a project in The Unscrambler®.

How to import data


Select the files to import from the file list in the Perten-DX Import dialog box or use the
Browse button to get a list of available files.
The different files must have the same number of X-variables and the same contents in the
Y-matrix to allow simultaneous import.
Perten-DX Import

Multiple selections
Select one or more files to import by checking the check box next to each file, or by using
Auto select matching spectra.
The contents of all the selected spectra will be merged to create a one data matrix during
import.
Deselect all

194
Import

Clear the current selection by unselecting all samples.


Preview spectra
Check to review a plot of selected spectra before importing.
Sample naming…
Include sample names or sample numbers in the resulting data table.
Sample names will only be imported if they are present in the source file.

Interpolate
By checking the Interpolate option this allows the import of data with different
starting and ending points, provided the number of points is the same in all sets to
be imported.
When the % button is selected, the following dialog appears allowing a user to set
the Tolerance for allowing data with different start or end points to be imported.
Interpolate Tolerance Dialog

For more information see the section on Import_Interpolate

Auto select matching spectra


The Auto select matching spectra preview option allows the automatic selection of all data
file(s) with the same wavelength ranges as the current selection.

195
The Unscrambler X Main

Use the Interpolate option to import data with different start or end points.

Sorting data
The file name, number of samples, number of X variables, number of Y variables, and
wavelengths for the first and last X-variables are displayed for each file.
The data table resulting from the import can be sorted based on any of these columns in the
file list: Click on a column header to set sort order, and a second time to reverse the sort
order.

Preview
Preview spectra displays line plots of selected files for import.

196
Import

5.22.4 Perten-DX file format reference


This format is based on JCAMP-DX file format. For more information on JCAMP-DX see the
section on Import JCAMP File Format

General
Perten-DX supports additional tags specific to Perten Instruments. These are:
Tag name Imported in Unscrambler as

##OWNER Information box

##INSTRUMENT S/N Category variable

##SPECTROMETER S/N Category variable

##LONG DATE Sample header

##PERTEN-TYPES Category variable

197
The Unscrambler X Main

Tag name Imported in Unscrambler as

##PERTEN-SAMPLEINFO Category variable

##PERTEN-REPACK Sample header

##PERTEN-REPEAT Sample header

Perten-DX file
The example below shows Perten-DX sample file.

##TITLE=2
##INSTRUMENT S/N=1201530
##INSTRUMENT TYPE=DA7250
##SPECTROMETER S/N=SNIR2148
##JCAMP-DX=4.24
##DATATYPE= NEAR INFRARED SPECTRUM
##LONG DATE=2013-10-18T01:59:18+02:00
##SAMPLE DESCRIPTION=2
##SMOOTHED=YES
##XUNITS= Nanometers (nm)
##YUNITS= Absorbance
##CONCENTRATIONS= (NCU)
(Protein Dry basis,-9.973E+23,<unknown>)
##PERTEN-TYPES= (KV)
(Product Type, Wheat),
(Shape Type, Unknown),
(Tray Type, Large Tray. rotating)
##PERTEN-REPACK=1
##PERTEN-REPEAT=1
##PERTEN-SAMPLEINFO= (KV)
##XFACTOR= 1.0
##YFACTOR= 0.000000001
##FIRSTX= 950.00
##LASTX= 1650.00
##NPOINTS= 141
##DELTAX= 5.0
##XYDATA= (X++(Y..Y))
950.0 186225975 188992413 193629553 199835249 207323496 215294014
222310809 227316331 230163481
995.0 231218537 230973747 229930179 228344771 226101418 223436221
220348573 216993825 213526732
1040.0 210076812 206678859 203519066 200372073 197183083 193896477
190813849 187961026 185361544
1085.0 183060794 181031311 179367942 178144637 177316150 176997467
177158004 178485737 182057610
1130.0 189131917 200696556 216125124 233953784 253292157 272636547
291094037 307752989 322292848
1175.0 335720686 348497384 360603909 370580710 377233357 380561567
380739361 377437577 370749286
1220.0 361610474 351741516 342353572 334328973 327783482 322877222
319254364 316585214 314597761
1265.0 313006114 311340643 309259709 306673122 303654410 300820687
298877629 297995673 298450579

198
Import

1310.0 300507674 304469670 310617035 318953135 329739582 342663051


357349953 373092331 389380072
1355.0 405360164 420025538 432690507 443690839 453913399 465033895
478927915 497519241 520603469
1400.0 547701532 578341832 610554253 641977198 670671475 694941644
714033309 728135504 737936222
1445.0 744584470 748870234 751802130 753593537 754701424 754774651
753793482 752142124 750221679
1490.0 747923597 745168624 742032801 738770350 735344011 731975306
728708573 725796673 723188418
1535.0 721043949 719373104 717859979 716709549 715573447 714720046
713740590 712450919 710535970
1580.0 708248969 705216090 701261550 696380943 690796672 684905943
678981726 673139165 666952182
1625.0 661182311 655418737 649996320 644795947 640163793 636351883 0 0 0
##END= $$ 2

5.23. RapID
5.23.1 RapID
Type of data
Array
Data dimensions
single vector spectrum
Instrument/hardware
Particle size analysers
Raman Spectrometers
Laser Induced Breakdown Spectrometers (LIBS)
Vendor
rap-ID Particle Systems
File name extension
.txt,.jcm

 File format information


 How to use it

5.23.2 About RapID data files


This option allows for the import of .txt and.jcm data from rap-ID particle size analyzers
instrument files.

5.23.3 File – Import Data – rap-ID…


One or several rap-ID files (.txt or.jcm) can be imported into a project in The Unscrambler®.

How to import data


Select the files to import from the file list in the RAP-ID Import dialog or use the Browse
button to display a list of available files. The different files must have the same number of X-
variables to allow simultaneous import.
RAP-ID Import

199
The Unscrambler X Main

The source files contain a single samples per file

Multiple selections
Select one or more files to import by checking the check box next to each file, or by using
Auto select matching spectra.
The contents of all the selected spectra will be merged to create one data matrix during
import.
Deselect all
Clear the current selection by unselecting all samples.
Preview spectra
Check to review a plot of selected spectra before importing.
Sample naming…
Include sample names or sample numbers in the resulting data table.
Sample names will only be imported if they are present in the source file.

200
Import

Auto select matching spectra


The Auto select matching spectra preview option allows the automatic selection of the all
data file(s) with the same wavelength ranges as the current selection. A screenshot of the
RAP-ID Import dialog with the auto select chosen is provided below.

Once Auto select matching spectra has been checked it will select only those files that have
the same number of variables.

Sorting data
The file name, number of samples, number of X-variables, are displayed for each file.
The data table resulting from the import can be sorted based on any of these columns in the
file list: Click on a column header to set sort order, and a second time to reverse the sort
order.

Preview
Preview spectra displays a line plot of selected files that have been selected for import.

201
The Unscrambler X Main

5.24. U5Data
5.24.1 U5 Data
File name extension
*.UNS

 File format information


 How to use it

5.24.2 About Unscrambler� 5.0 data files


Imports data files from earlier versions of The Unscrambler� (versions 3.0 - 5.5). If the file
to be imported contains several matrices, a dialog pops up to let the user specify which
matrices to import.

202
Import

Note: The Unscrambler� recognizes the extensions: .UNS, .UNM, .UNP, and .CLA.
Rename the files if they have other extensions.

5.24.3 File – Import Data – U5 Data…


Imports data files from earlier versions of The Unscrambler® (versions 3.0 - 5.5). If the file to
be imported contains several matrices, all of the matrices will be available to import. The
user can define which matrices to import, When multiple matrices are selected, they will be
combined into a single matrix.

How to import U5 data


Select the files to import from the file list in the U5 Import dialog box or use the Browse
button to obtain a list of available files. The U5 Import dialog box displays a list of matrices
from which one may import U5 data. This includes the matrix names, the number of rows,
and the number of columns. When selecting multiple matrices, use the radio buttons at the
top to specify whether they should be combined in terms of rows or columns.
U5 Data import

203
The Unscrambler X Main

5.25. UnscFileReader
5.25.1 The Unscrambler® 9.8
Type of data
Array
Software
The Unscrambler® 9.8
Vendor
CAMO Software
File name extensions
*.??M, *.??D

 File format information


 How to use it

204
Import

5.25.2 About The Unscrambler® 9.8 file formats


The Unscrambler® X features a new file format, but files created by versions 9.2 to 9.8 can
be imported.
More details.

5.25.3 File – Import Data – Unscrambler…


Import data and model matrices from files made by versions 9.2 to 9.8 of The Unscrambler®
into the Editor.
Select a file and the imported data and plots will appear in the project navigator.
Not all plots are available for models that were created in versions of The Unscrambler®
before 9.8. In such instances, the user is recommended to import the data, and rebuild the
models.

5.25.4 The Unscrambler® 9.x file format reference


The Unscrambler® 9.x used the file name extensions listed below to distinguish between
different data types:
The Unscrambler® 9.x files File name extension

Non-designed raw data .00D

Fractional factorial design .01D

Full factorial design .02D

Combined design .03D

Central Composite design .04D

Plackett-Burman design .05D

Box-Behnken design .06D

D-optimal design .07D

Statistics .10D

PCA .11M

Analysis of Effects .20D

Response Surface .21D

Prediction .30D

Classification .31D

MLR .40M

PLS1 .41M

PLS2 .42M

205
The Unscrambler X Main

The Unscrambler® 9.x files File name extension

PCR .43M

Three-way PLS .44M

MSC .50D

Lattice design (mixtures) .60D

Centroid design (mixtures) .61D

Axial design (mixtures) .62D

D-optimal mixture design .63D

3-D data table .70D


Each of the .??D files above may have the following corresponding additional files:

 .??L Log file


 .??P Preference file (settings for the file when it closes)
 .??T Notes file
 .??W Warnings file

The Unscrambler® 9.8 introduced a merged file format combining .??[DLPTW] into one file,
.??M.
A few details to remember about the file sets that comprise each data table or saved result:

 When transferring data to another place using the Windows Explorer, make sure
that all the associated physical files are copied!
 Do not change the file name extensions The Unscrambler® uses. Doing so may
create problems to access the files from within The Unscrambler®.
 The log and notes files are plain ASCII files which can be opened and viewed using a
text editor.

5.26. UnscramblerX
5.26.1 The Unscrambler® X
Type of data
Array
Software
The Unscrambler® X
Vendor
CAMO Software
File name extensions
*.unsb

 File format information


 How to use it

206
Import

5.26.2 About The Unscrambler® X file format


The native file format used by The Unscrambler® X have the .unsb file name extension, a
proprietary binary format made specifically for The Unscrambler® to provide fast and
efficient storage of large data sets and multivariate models.

5.26.3 File – Import Data – Unscrambler X…


This option allows one to import data tables and models from another The Unscrambler® X
project file.
How to import data
Use File – Import Data – Unscrambler X…

After selecting the import target, click OK to enter the Import dialog.

207
The Unscrambler X Main

Select a data set or model to import.

5.27. Varian
5.27.1 Varian
Type of data/instrument

Data dimensions
Multiple spectra, constituents
Instrument/hardware
Cary UV-Vis
Software

Vendor
Varian, Inc.
File name extension
*.bsw

 File format information


 How to use it

5.27.2 About Varian data files


This option allows one to import data from files generated by Varian UV-Vis instruments and
related software.
Source files with .bsw file name extension are supported.

208
Import

5.27.3 File – Import Data – Varian…


This option allows one to import data from files generated by Varian instruments and
related software (Cary UV-Vis instruments).
Source files with .bsw file name extension are supported.

How to import data


Selecting the Varian dialog box displays a list of files from which one can import Varian data.
If necessary, click the Browse button close to the Look in: field in order to access files from a
different folder.
VARIAN Import

The source files may contain one or more samples per file. Multiple selections allow several
samples to be imported at the same time.

Multiple selections
Select one or more files to import by checking the check box next to each file, or by using
Auto select matching spectra.
The contents of all the selected spectra will be merged to create a one data matrix during
import.
Deselect all
Clear the current selection by unselecting all samples.
Preview spectra
Check to review a plot of selected spectra before importing.
Sample naming…
Include sample names or sample numbers in the resulting data table.
Sample names will only be imported if they are present in the source file.

209
The Unscrambler X Main

Interpolate
By checking the Interpolate option this allows the import of data with different
starting and ending points, provided the number of points is the same in all sets to
be imported.
When the % button is selected, the following dialog appears allowing a user to set
the Tolerance for allowing data with different start or end points to be imported.
Interpolate Tolerance Dialog

For more information see the section on Import_Interpolate


Use the Interpolate option to import data with different start or end points.

Auto select matching spectra


Auto select matching spectra preview option provides automatic selection of all the data
file(s) with the wavelength ranges as the current selection. This dialog is used by input
spectral data from instruments with Varian file format.

210
Import

Once the Auto select matching spectra option has been checked it will select the files having
the same variables from the list.
Use the Interpolate option to import data with different start or end points.

Sorting data
The file name, number of samples, number of X variables, number of Y variables, and
wavelengths for the first and last X-variables are displayed for each file.
The data table resulting from the import can be sorted based on any of these columns in the
file list: Click on a column header to set sort order, and a second time to reverse the sort
order.

Preview
Preview spectra displays a line plot of selected files that have been selected for import. A
screenshot of the Varian Import dialog with the preview spectra chosen is given below.

211
The Unscrambler X Main

5.28. VisioTec
5.28.1 VisioTec
Type of data/instrument :
Data dimensions
single vector spectrum or multiple spectra in an array
Instrument/hardware : :
Vendor
VisioTec
File name extension :

 File format information


 How to use it

212
Import

5.28.2 About VisioTec data files


This option allows for the import of data files created with the Uhlmann VisioTec NIR
Inspection systems.

5.28.3 File – Import Data – VisioTec…


This option allows a user to import data files created with the Uhlmann VisioTec NIR
inspection systems. Source files with the following file name extensions are supported:
.ldfor ‘.dat’.

How to import data


Select the files to import from the file list in the VisioTec Import dialog box or use the
Browse button to obtain a list of available files. The VisioTec Import dialog box displays a list
of files from which one may import VisioTec data. This includes the file names, the number
of X-variables, names of the First and Last X-variables and step size.
VisioTec Import

The source files may contain one or many samples per file; multiple selection allows for the
import of several files (blocks of data) at the same time.

Multiple selections
Select one or more files to import by checking the check box next to each file, or by using
Auto select matching spectra.
The contents of all the selected spectra will be merged to create one data matrix during
import.
Deselect all
Clear the current selection by unselecting all samples.
Preview spectra

213
The Unscrambler X Main

Check to review a plot of selected spectra before importing.


Sample naming…
Include sample names or sample numbers in the resulting data table.
Sample names will only be imported if they are present in the source file.

214
6. Export
6.1. Exporting data
This section describes how to export data from The Unscrambler®.

6.1.1 Supported data formats


The Unscrambler® can export data in the following data formats:

 ASCII
 JCAMP-DX
 NetCDF
 Matlab
 AMO: The Unscrambler® ASCII Model
 DeltaNu

6.1.2 How to export data


Select a format from the File – Export menu, which will open an Export dialog specific to the
given file format.
After selecting the model, or the data matrix and range to export, entering meta data and
other storage options, press OK to specify the directory and file name to save the exported
data to.

6.2. AMO
6.2.1 Export models to ASCII
The Unscrambler® ASCII-MOD file is an ASCII-based file format used to transfer models from
The Unscrambler® to compatible instruments and prediction software.

 File format information


 How to use it

6.2.2 About the ASCII-MOD file format


The Unscrambler® ASCII-MOD file is an easy-to-read ASCII-based file format capable of
representing models created by The Unscrambler® and contains all information necessary
for prediction and classification.
The file format is used to transfer models to compatible instruments and prediction
software.
The files are saved with a .amo file name extension.

6.2.3 File – Export – ASCII-MOD…


ASCII-MOD export dialog

215
The Unscrambler X Main

Select model
A drop-down list contains all models found in the currently open project. Select the
one to export.
Type
Choose between Full and Short prediction storage, where the second is used to
achieve smaller file size when only the regression coefficients are used for
prediction.
PCs
The number of Principal Components or factors to include in the exported model.
Y-Variable
Include the Y-variables to be included with the model.
Press OK and use the file dialog to select the destination directory and give a file name to
save the model.

6.2.4 ASCII-MOD file format reference


File structure
An ASCII-MOD file contains all information necessary for prediction and classification.
The ASCII-MOD file is an easy-to-read ASCII file. The table below lists the matrices which are
found in the ASCII-MOD file, depending on the type of ASCII-MOD file and type of model.
When generating an ASCII-MOD file, one can choose between “Short” (referred to as “Mini”
in previous versions of the software) and “Full” storage. Matrices stored under these options
are indicated with ‘x’ in the table.
ASCII-MOD file matrices
Matrix name Short Full PCA Full Regr. Rows Columns

B x x PC (1-a) X-var (1-x)

B0 x x PC (1-a) 1 row

xWeight x x 1 row X-var (1-x)

yWeight x 1 row Y-var (1-y)

xCent x x 1 row X-var (1-x)

yCent x 1 row Y-var (1-y)

ResXValTot x x PC (0-a)

216
Export

Matrix name Short Full PCA Full Regr. Rows Columns

ResXCalVar x x PC (0-a) X-var (1-x)

ResXValVar x x PC (0-a) X-var (1-x)

ResYValVar x PC (0-a) Y-var (1-y)

ResXCalSamp x x PC (0-a) Samp (1-i)

Pax x x PC (1-a) X-var (1-x)

Wax x x PC (1-a) X-var (1-x)

Qay x PC (1-a) Y-var (1-y)

SquSum x x [1] PC (1-a)

HiCalMean x PC (1-a) 1 row

ExtraVal x 1 row [2]

RMSECal x PC (1-a) Y-var (1-y)

TaiCalSDev x x PC (1-a) 1 row

xCalMean x x 1 row X-var (1-x)

xCalSDev x x 1 row X-var (1-x)

xCal x x 1 row X-var (1-x)

yCalMean x 1 row Y-var (1-y)

yCalSDev x 1 row Y-var (1-y)

yCal x 1 row Y-var (1-y)


Table of result matrices:

 SquSumT, SquSumW, SquSumP, SquSumQ, MinTai, MaxTai


 RMSEP, SEP, Bias, Slope, Offset, Corr, SEPcorr, ICM-Slope, ICM-Offset

Note: The contents of the columns “Rows” and “Columns” shows the contents of
the ASCII-MOD file, not the contents of the matrices in the main model file.

Example of an ASCII-MOD File

TYPE=FULL // (MINI,FULL)
VERSION=1
MODELNAME=F:\U\EX\DATA\TUTBPCA.11D
MODELDATE=10/27/95 11:41:13
CREATOR=Joe Doe
METHOD=PCA // (PCA, PCR, PLS1, PLS2)
CALDATA=F:\U\EX\DATA\TUTB.00D
SAMPLES=28

217
The Unscrambler X Main

XVARS=16
YVARS=0
VALIDATION=LEVCORR // (NONE,LEVCORR,TESTSET,CROSS)
COMPONENTS=2
SUGGESTED=2
CENTERING=YES // (YES,NO)
CALSAMPLES=28
TESTSAMPLES=28
NUMCVS=0
NUMTRANS=2
TRD:DNO // ,,,,,,,complete transformation string
TRD:DSG // ,,,,,,,complete transformation string
NUMINSTRPAR=1
##GAIN=5.2
MATRICES=13
"xWeight" // (Name of 13 matrices)
"xCent"
"ResXValTot"
"ResXCalVar"
"ResXValVar"
"ResXCalSamp"
"Pax"
"Wax"
"SquSum"
"TaiCalSDev"
"xCalMean"
"xCalSDev"
"xCal"
%XvarNames
"Xvar1" "Xvar2" "Xvar3" "Xvar4"
"Xvar5" "Xvar6" "Xvar7" "Xvar8"
"Xvar9" "Xvar10" "Xvar11" "Xvar12"
"Xvar13" "Xvar14" "Xvar15" "Xvar16"
%xWeight 1 16
.1000000E+01 .1000000E+01 .1000000E+01 .1000000E+01 .1000000E+01
.1000000E+01 .1000000E+01 .1000000E+01 .1000000E+01 .1000000E+01
.1000000E+01 .1000000E+01 .1000000E+01 .1000000E+01 .1000000E+01
.1000000E+01
%xCent 1 16
.1677847E+01 .2258536E+01 .2231011E+01 .2404268E+01 .2179311E+01
.2470489E+01 .2079168E+01 .1734536E+01 .1475164E+01 .1480657E+01
.1644097E+01 .1805900E+01 .1980229E+01 .1795443E+01 .1622796E+01
.1497418E+01
,,,
,,,etc.
Description of fields
The below table lists the data field codes used in ASCII-MOD files.
Description of fields
Field Description

TYPE (MINI,FULL) MINI gives “Prediction Light” only

VERSION Increases by one for each changes of the file format after release

MODELNAME Name of model

218
Export

Field Description

MODELDATE Date for creation of the model (not the ASCII-MOD file)

CREATOR Name of the user who made the model (not the ASCII-MOD file)

METHOD Calibration method (PCA, PCR, PLS1, PLS2) 1

CALDATA Name of data set used to make the model

SAMPLES Number of samples used when making the model

XVARS Number of X variables used when making the model

YVARS Number of Y variables used when making the model

VALIDATION (TEST,LEV,CROSS)

COMPONENTS Number of components present in the ASCII-MOD file

Suggested number of components to use (may not be on the


SUGGESTED
ASCII-MOD file)

CENTERING (YES,NO)

CALSAMPLES Number of calibration samples

TESTSAMPLES Number of Test samples

NUMCVS Number of Cross Validation Segments

NUMTRANS Number of transformation strings

INSTRUMENT
See below
PARAM.

TRANSFORMATIONS Number of transformations

Number of matrices on this file. One name for each matrix follows
MATRICES
below

Transformation strings
There is one line for each transformation. The format of the line will depend on type of
transformation. If a transformation needs more data which is the case for MSC, this extra
data will be stored as matrices at the end of the file. References to these matrices can be
done by names.
Examples
A transformation named TRANS using one parameter could look like this:

TRANS:TEMP=38.8;
A MSC transformation may look something like this:

MSC:VARS=19,SAMPS=23,MEAN="ResultMatrix18",TOT=" ResultMatrix19";
Transformation strings may also contain error status which is the case when the MSC-base
have been deleted from file before making the ASCII-MOD file.

219
The Unscrambler X Main

Transformation strings
Main Description Secondary Description

ANA Analysis… AOE Analysis of Effects

CLA Classification

MLR Multiple Linear Regression

PCA Principal Component Analysis

PCR Principal Component Regression

PL1 Partial Least Squares 1

PL2 Partial Least Squares 2

PRE Prediction

RES Response Surface Analysis

STA Statistics

APP Append… SAM Sample

VAR Variable

COM Compute… MAT Matrix

VEC Vector

DEL Delete… SAM Sample

VAR Variable

IMP Import —

INS Insert… SAM Sample

VAR Variable

REP Replace —

SHI Shift Variables —

SOR Sort Samples —

TRA Transform… ATR Absorbance to Reflectance

BAS Baseline

DNO Norris Derivative

DSG S. Golay Derivative

MNO Maximum Normalization

220
Export

Main Description Secondary Description

MSC Multiplicative Scatter Correction

NOI Added Noise

NOR Mean Normalization

RED Reduce

RNO Range Normalization

RTA Reflectance to Absorbance

RTK Reflectance to Kubelka-Munck

SMA Moving_Average Smoothing

SSG S. Golay Smoothing

TSP Transpose

USR User-Defined

Storage of matrices
Each matrix starts with a header as in this example:

%Pax 10 155
Telling: Matrix name is Pax the matrix has the dimension 10 rows and 155 columns. From
the next line the data elements will follow in the following sequence:

Pax(1,1) Pax(1,2) Pax(1,3) , , Pax(1,7)


Pax(1,8) Pax(1,9) , , , ,
Pax(1,xvars-1) Pax(1,xvars)
Pax(1,2) Pax(2,2) Pax(2,3) , , ,
, ,
Pax(comp,1) Pax(comp,2) , , , Pax(comp,xvars)
A missing value will simply be written as the character m.

 If the calibration model was made using 1 Y variable, it uses PLS1, and if it was
created using >1 Y variable the AMO file uses PLS2.

6.3. ASCII
6.3.1 ASCII export
The ASCII export option is very useful if one wants to work with the data table in another
program.

 File format information


 How to use it

221
The Unscrambler X Main

6.3.2 File – Export – ASCII…


Many other programs can read ASCII files. This export option therefore is very useful if one
wants to work with the data table in another program.
ASCII export dialog

Select the matrix and data ranges that make up the data to be exported, or use Define to
create a new range.

Options
Include headers
Specify sample names and variable names are to be exported by selecting them in
the Include headers field. They will be placed in the first column and in the first row,
respectively.
Name qualifier
String data, such as headers, may be quoted, using either double quotes ", or single
quotes '.
It is recommended to mark text with quotes and not mark numbers, because it
makes it easier for importing programs to assign correct data types to text and
numbers.
Default is ".
Numeric qualifier
Numeric data, may be quoted similar to headers.

222
Export

Default is None.
Item delimiter
Table cell entries may be delimited by different characters.
Default is ,.
String representation of missing data
Specify how missing data are to be coded in the ASCII file.
Default is m.
For compatibility with software that doesn’t have support for importing missing data
as strings, use a large negative number, such as -9.9730e+023 instead.

6.4. DeltaNu
6.4.1 DeltaNu
The DeltaNu file is a model file format developed for use with the DeltaNu Pharma-ID Raman
spectrometers. It contain all the necessary information for projection and classification. PCA
Models created in The Unscrambler� X can be exported to this file format. Such models are
compatible with DeltaNu Raman instrumentation for real-time projections.
The files are saved with a .dnub file name extension.

 File format information


 How to use it

6.4.2 File – Export – DeltaNu…


To export a PCA model to the DeltaNu format, go to File- Export-DeltaNu.. and the following
dialog will appear.
DeltaNu export dialog

Select model
A drop-down list contains all models found in the currently open project. Select the
one to export. Only PCA models are supported in the DeltaNu format.
PCs
The number of Principal Components to include in the exported model. The default
value given is the optimal number of PCs for the model. It is recommended to export
a model with the optimal number of PCs. To export the model with a different
number of PCs use the drop-down list to choose a different number of PCs.

223
The Unscrambler X Main

Press OK and use the file dialog to select the destination directory and give a file name to
save the model.

6.5. JCampDX
6.5.1 JCAMP-DX export

 File format information


 How to use it

6.5.2 File – Export – JCAMP-DX…


The JCAMP-DX format is read by many instrument software. This file format requires that
the X-part of the data have numerical names, e.g. wavelengths, wavenumbers, retention
times, etc.
JCAMP-DX export dialog: Select data

Select the matrix and data ranges that make up the data to be exported, or use Define to
create a new range.

Metadata
Then, in the File Info tab, enter information related to the JCAMP-DX file as a whole. Here
one must choose between two JCAMP-DX formats: XYPoints and XYData. XYData requires
that the distance between each variable is the same throughout the whole X-Variable Set.
XYData produces smaller file sizes than XYPoints.
JCAMP-DX export dialog: File info

224
Export

Title
Name of the data set
Origin
Can be the name of the lab, client name, batch number, or location where data
came from.
Owner
Name of the person conducting the experiment or the analysis.
Enter information related to the samples in the Samples Info tab. This information is saved
with each sample.
JCAMP-DX export dialog: Sample info

225
The Unscrambler X Main

Sample names
Select either Use sample name from data table or Use text to specify manually
Sampling procedure
Details on how the data was collected.
Data processing
List the transformations applied to prepare the data.
Data type
Select appropriate value from the drop-down list.
X units
Select appropriate value from the drop-down list.
Y units
Select appropriate value from the drop-down list.
Click OK to save the file.

6.6. Matlab
6.6.1 Matlab export
The Unscrambler® provides the capability to export data tables to Matlab including sample
names (row headings in The Unscrambler®) and variable names (column names in The
Unscrambler®).

 File format information


 How to use it

6.6.2 File – Export – Matlab…


The Unscrambler® provides the capability to export data tables to Matlab including sample
names (row headings in The Unscrambler®) and variable names (column names in The
Unscrambler®).
Matlab export dialog

Select the matrix and data ranges that make up the data to be exported, or use Define to
create a new range.

226
Export

Options
Select whether sample and variable names should be exported. If this option is selected then
these names are stored in separate arrays within the export file as normally done in Matlab.
Select Use Compression to use gzip-compression for arrays stored to the Matlab file. This
will reduce the file size.
The exported data is saved as filename.mat, where “filename” represents the name
entered for the file on saving.

Reading the file in Matlab


To load the converted file, type load filename in the Matlab command window. If the data
are exported without sample and variable names, the filename.mat file contains one
variable called “Matrix” that contains The Unscrambler® worksheet data.
Sample and variable names
If the data are exported with sample and variable names, the file contains 2
additional arrays: “ObjLabels” and “VarLabels”.
“ObjLabels” contains row (sample) names.
“VarLabels” contains are column (variable) names.
Both are character arrays.
Missing Value Conversion
Missing values in a worksheet in The Unscrambler® are converted to the number -
9.9730e+023.
Converting category variables
Category variables are converted into integers.
Note: The array names (“Matrix”, “VarLabels”, and “ObjLabels”) are the same in
each exported file from The Unscrambler®. Thus, if several converted files are
loaded into Matlab, rename the variables in Matlab after each load command or
they will be overwritten by subsequent import operations.

6.7. NetCDF
6.7.1 NetCDF export

 File format information


 How to use it

6.7.2 File – Export – NetCDF…


NetCDF (Network Common Data Format) is a set of software libraries and machine-
independent data formats that support the creation, access, and sharing of array-oriented
scientific data.
Upon choosing File – Export – NetCDF… an export dialog will open:

227
The Unscrambler X Main

Select the matrix and data ranges that make up the data to be exported, or use Define to
create a new range.

Metadata
In the field Global Attributes, enter all other relevant details:
Data set origin
Can be the name of the lab, client name, batch number, or location where data
came from.
Equipment ID
Can be the product name, product number, serial number, or IP address of the
instrument used.
Equipment manufacturer
Name of the instrument vendor.
Equipment type
Type of instrument used, e.g. NIR.
Operator name
Name of the person conducting the experiment or the analysis.
Experiment date time
Date and time of the data collection. It is suggested to enter the date according to
the ISO 8601 standard, e.g. 2010-01-27T09:55:41+0100.
All attributes are optional. It is generally recommended to add metadata to files for better
file search results.

228
Export

6.8. UnscFileWriter
6.8.1 Export models to The Unscrambler® v9.8
The Unscrambler® 9.8 file is the previous file format and models in this format contain all
the necessary information for prediction and classification. Models (PCA, MLR, PCR and PLS)
created in The Unscrambler® X can be exported to this previous file format using the File
writer plug-in. Such models are compatible with OLUP and OLUC 9.8 software for real-time
classification and prediction.

 File format information


 How to use it

6.8.2 About The Unscrambler® file format


Model files (MLR, PCR, PLSR and PCA) can be exported to The Unscrambler® 9.8 format using
the File Writer plug in.
Some methods and features that were not available in Unscrambler® 9.8 cannot be
exported. These include:
 Models registered with following pretreatments
 Orthogonal Signal Correction (OSC)
 Correlation Optimized Warping (COW)
 Weights
 Deresolve
 Quantile Normalization
 Basic ATR correction (Spectroscopic transformation)
 Models with cross validation based on category variable
 The following classification models
 Linear Discriminant Analysis (LDA, PCA-LDA)
 Support Vector Machine Classification (SVM-C)
 SIMCA classification
 Support Vector Machine Regression (SVM-R)
 Prediction, classification or projection results from The Unscrambler® X
The Unscrambler® 9.x used the file name extensions listed below to distinguish between
different data and model types:
The Unscrambler® 9.x files File name extension

Non-designed raw data .00D

PCA .11M

MLR .40M

PLS1 .41M

PLS2 .42M

PCR .43M

229
The Unscrambler X Main

6.8.3 File – Export – Unscrambler…


Unscrambler export dialog

Available models
A drop-down list contains all models found in the currently open project that can be
exported. Select the one to export.
Model Information
This contains details about the model selected
Notes
The time the chosen model was created is given here, along with any other
information that has been added to the Notes section of the chosen model. Users
may also add additional information in the Notes section, which will be available in
the exported model.
Save model with components
Use the components box to select the correct number of components for saving the
model in 9.8 format. The set number of components for the model will be displayed
and used by default.
Save as micro model
The check box allows user to save the model in 9.8 micro format.
Press OK and use the file dialog to select the destination directory and give a file name to
save the model.

230
7. Plots
7.1. Line plot
A line plot displays a single series of numerical values with a label for each element. The plot
has two axes:

 The horizontal axis shows the labels, in the same physical order as they are stored in
the source file;
 The vertical axis shows the scale for the plotted numerical values.

The points in this plot can be represented in several ways:


As a Curve
A curve linking the successive points is more relevant to study a profile, and if the
labels displayed on the horizontal axis are ordered in some way (e.g. PC1, PC2, PC3).
Line Plot: Curve display for following a batch evolution

With Symbols
Symbols produce the same visual impression as a 2-D scatter plot (see Scatter Plot),
and are therefore not recommended.
Line plot: symbol display

231
The Unscrambler X Main

Several series of values which share the same labels can be displayed on the same line plot.
The series are then distinguished by means of colors.
Line plot: 2 series with curve display

7.2. Bar plot


A bar plot displays a single series of numerical values with a label for each element. The plot
has two axes:

 The horizontal axis shows the labels, in the same physical order as they are stored in
the source file;
 The vertical axis shows the scale for the plotted numerical values.

The vertical bars emphasize the relative size of the numbers.


Bar plot of a series

232
Plots

Several series of values which share the same labels can be displayed on the same bar plot.
The series are then distinguished by means of colors, and an additional layout is possible:
accumulated or stacked bars. Accumulated bars are relevant if the sum of the values for
series1, series2, etc. has a concrete meaning (e.g. total production or composition).
Two layouts of a bar plot for two series of values: Bars and Accumulated Bars

233
The Unscrambler X Main

7.3. Scatter plot


A 2-D scatter plot displays two series of values which are related to common elements. The
values are shown indirectly, as the coordinates of points in a 2-dimensional space: one point
per element.
As opposed to the line plot, where the individual elements are identified by means of a label
along one of the axes, both axes of the 2-D scatter plot are used for displaying a numerical
scale (one for each series of values), and the labels may appear beside each point.

234
Plots

Various elements may be added to the plot, to provide more information:

 A regression line visualizing the relationship between the two series of values

Scatter plot with the regression line

 A target line, valid whenever the theoretical relationship should be “Y=X”

Scatter plot with the target and the regression lines

235
The Unscrambler X Main

 Plot statistics, including among others the slope and offset of the regression line
(even if the line itself is not displayed) and the correlation coefficient.

Scatter plot with statistics and the regression line

7.4. 3-D scatter plot


A 3-D scatter plot displays three series of values which are related to common elements. The
values are shown indirectly, as the coordinates of points in a 3-dimensional space: one point
per element.
A 3-D scatter plot

236
Plots

All the plots can be customized. This is done from the properties dialog which is accessed by
a right click on the plot and the selection of the Properties menu,

or by selecting the properties shortcut from the toolbar


When selecting the Properties menu, the Plot properties dialog appears.
Each of the following items can be modified:
Axis X, its gridlines and axis labels
The visibility, the title with its font and position, the scale - both its appearance
(logarithmic or reversed) and its labels - and origin can be modified on the X axis.
The axis label rotation can also be set in this menu.
Properties Axis X

237
The Unscrambler X Main

Axis Y and Z and its gridlines


Access to the same possibilities as the Axis X and its gridlines.
Appearance
Four different items can be customized from this menu and its sub-menu:

 Background
 Header: title, color, font, visibility, color of the background
 Legend: title, color, font, visibility, color of the background
 Plot Area: Chart area, color, font, visibility, borders, surface

Properties Appearance

238
Plots

For the Header and Legend the text can be edited. One can customize the name,
such as only having part of the name displayed, the font and the color.
Properties Header

Graphic Objects
It is possible to include some graphical objects in the plot such as line, arrow,
rectangle, ellipse and text. Each of those objects can be configured in terms of color,
thickness and font if necessary.
3-D scatter plots can be enhanced by:
Addition of vertical lines
They “anchor” the points and can facilitate the interpretation of the plot.
A 3-D Scatter plot displayed with anchors

239
The Unscrambler X Main

To add vertical lines, click on More (see section below on Additional Options).
Rotation
The plot can be rotated so as to show the relative positions of the points from a
more relevant angle; this can help detect clusters. Click on the plot and move it with
the cursor in the appropriate direction.
A 3-D Scatter plot after rotation

240
Plots

The axes can be interchanged in plot, using the arrows on the toolbar. If more than three
columns are selected, the axes can be changed from the drop-down lists next to the axis
arrows on the toolbar.
Additional options
Click on More to access more options for 3D scatter plots.
Scroll through the

 Gallery
 Data
 3D-View

options to customise the appearance of 3D Scatter Plots. These features are described in the
following,
3D Scatter plot gallery

Select from the gallery of plots to obtain the desired appearance of the plot.
3-D Scatter plot data

241
The Unscrambler X Main

Define plot specifics with these options.


3-D Scatter plot 3-D view properties dialog

The rotation, perspective, and axis scales can be changed under the 3-D view tab.

242
Plots

7.5. Matrix plot


The matrix or surface plot can be seen as the 3-dimensional equivalent of a line plot to
display a whole table of numerical values with a label for each element along the 2
dimensions of the table. The plot has up to three axes:

 The first two show the labels, in the same physical order as they are stored in the
source file;
 The vertical axis shows the scale for the plotted numerical values.

Depending on the layout, the third axis may be replaced by a color code indicating a range of
values (contour plot), thus making the surface plot essentially a contour plot or a map plot
when looking at it straight from above. The layout can be changed by right clicking on the
plot, and selecting Plot type for a shortcut to predefined layouts, or select Properties to
customize 3-D plots, and make changes to the axes, legends, etc..
The Plot type submenu

The points can either be represented individually, or summarized according to one of the
following layouts:
Surface
It shows the table as a 3-D landscape.
Matrix plot with a landscape display

Contour
The contour plot has only two axes. A few discrete levels are selected, and points
(actual or interpolated) with exactly those values are shown as a contour line. It
looks like a geographical map with altitude lines;
Matrix plot with a contour display

243
The Unscrambler X Main

This option is accessible from Plot type – Contour, or the Properties of the plot:
Surface plot menu

Map
On a map, each point of the table is represented by a small colored square, the color
depending on the range of the individual value. The result is a completely colored
rectangle, where zones sharing close values are easy to detect. The plot looks a bit
like an infrared picture.

244
Plots

Matrix plot with a map display

This option is accessible from Plot type – Map, or the Properties of the plot, the
option is Scatter chart, zoned, 2D projection.
Scatter plot menu

Bars
This option gives roughly the same visual impression as the landscape plot if there
are many points, otherwise the “surface” appears more rugged.
Matrix plot with a 3-D bar display

245
The Unscrambler X Main

This option is accessible from the Properties of the plot.


Bar plot menu

3-D-Scatter is also accessible via this Properties menu, see 3-D scatter plot for help on that
plot.

246
Plots

7.6. Histogram plot


A histogram summarizes a series of numbers without actually showing any of the original
elements. The values are divided into ranges (or “bins”), and the elements within each bin
are counted.
The plot displays the ranges of values along the horizontal axis, and the number of elements
as a vertical bar for each bin.
Histograms are used to plot the data distribution, and often for density estimation:
estimating the probability density function of the underlying variable. The total area of a
histogram used for probability density is always normalized to 1. If the length of the intervals
on the x-axis are all 1, then a histogram is identical to a relative frequency plot.
A statistics table can be added to the plot by clicking the button. This will print the
number of data elements as well as the distribution statistics Skewness (i.e. asymmetry),
Kurtosis (i.e. flatness), Mean, Variance and the Standard Deviation (SDev).
It is possible to redefine the number of bins, to improve or reduce the smoothness of the
histogram, using the drop-down list Bars.

A histogram with different configurations: Few or Numerous bins

247
The Unscrambler X Main

The histogram is one of the seven basic tools of quality control, which also include the
Pareto chart, check sheet, control chart, cause-and-effect diagram, flowchart, and scatter
diagram.

7.7. Normal probability plot


The normal probability plot is a graphical technique for normality testing: assessing whether
or not a data set is approximately normally distributed.
The data are plotted against a theoretical normal distribution in such a way that the points
should form an approximate straight line. Departures from this straight line indicate
departures from normality. Each element of the series is represented by a point. A label can
be displayed beside each point to identify the elements.
This type of plot enables a visual check of the probability distribution of the values.
Normal distribution

248
Plots

If the points are close to a straight line, the distribution is approximately normal
(Gaussian).
Normal probability plot showing a series following a Normal distribution

Normal distribution with outliers


If most points are close to a straight line but a few extreme values (low or high) are
far away from the line, these points are outliers. In the example below sample 50
looks like an outlier.
Normal probability plot showing a series following Normal distribution with an
outlier

Not a Normal distribution


If the points are not close to a straight line, but determine another type of curve, or
clusters, the distribution is not normal.

249
The Unscrambler X Main

Normal probability plot showing a series not following a Normal distribution

7.8. Multiple scatter plot


This plot displays several scatter plots. A maximum of five variables at a time are used and
scatter plots for each pair of variables are shown above the diagonal. The variables are
indicated on the diagonal and can be changed from the list.
Multiple scatter plot structure
Variable 1 Variable 2 Variable 3

Variable Scatter plot between Scatter plot between


Name of variable 1
1 Variable 1 and 2 Variable 1 and 3

Variable R-square for variable Scatter plot between


Name of variable 2
2 1 and 2 Variable 2 and 3

Variable R-square for variable R-square for variable 2


Name of variable 3
3 1 and 3 and 3
The colors of the panels on the lower diagonal are an indicator of the correlation. Positive
correlation is indicated in shades of blue while negative values are shown in shades of red.
This plot helps in quickly identifying relationships between variables and allows one to
choose variables to examine in greater detail.
It is specially useful to detect which variables are responsible for a discrimination of sample
groups for example.
Access the Multiple Scatter plot from the menu Plot - Multiple Scatter
Plot - Multiple Scatter menu

250
Plots

Then it is necessary to specify the scope of the plot.


Multiple Scatter plot Scope

Once the variables are selected, click OK and the plot will appear in the viewer.
Multiple scatter plot

If more than four variables have been selected for the multiple scatter plot, others can be
displayed by choosing them from the drop-down list on the diagonal of the plots.
Variable drop-down list menu

251
The Unscrambler X Main

7.9. Tabular summary plots


A table plot is nothing more than results arranged in a tabular format, displayed in a
graphical interface which optionally allows for resizing and sorting the columns of the table.
Although it is not a “plot” as such, it allows tabulated results to be displayed in the same
viewer system as other plots.
Example of table plot: Table of Correlation

The table plot format is used under two different circumstances:


 A few analysis results require this format, because it is the only way to get an
interpretable summary of complex results. A typical example is Analysis of Variance
(ANOVA); some of its individual results can be plotted separately as line plots, but
the only way to get a full overview is to study 4 or 5 columns of the table
simultaneously.
 Standard graphical plots like line plots, 2-D scatter plots, matrix plots, etc. can be
displayed numerically to facilitate the exportation of the underlying numbers to
another graphical package, or a worksheet.
To do so, use the option View Numerical accessible in two ways: from a right click
on the plot and from the View menu.
View Numerical option from a Right click on the plot and from the View menu

252
Plots

7.10. Special plots


This is an ad-hoc category which groups all plots that do not fit into any of the other
descriptions.
Some of these plots are an adaptation of existing plot types, with an additional
enhancement, while other plots have been developed to answer specific needs.
Mean and standard deviation plot
For instance, “Means” can be displayed as a line plot. However to include standard
deviations (SDev) into the same plot which is quite useful, the most relevant way to do so is
to:

 configure the plot layout as bars;


 and display SDev as an error bar on top of the Mean vertical bar.

This is what has been done in the special plot “Mean and SDev”.
Special plot: Mean and SDev

253
The Unscrambler X Main

Visualize the outcome of a multiple comparisons test


This plot presents the level of a design variable that have significantly different effects on a
response variable in a graphical way which gives an immediate overview.
Special plot: Multiple Comparisons

Qualify the quality of a prediction


The Predicted with deviation plot shows the predicted value as well as the possible
deviation. It gives a direct answer to the level of trust to have on the results. The deviations
are estimated as a function of the global model error, the sample leverage, and the sample
residual X-variance. A large deviation indicates that the sample used for prediction is not
similar to the samples used to make the calibration model. This is a prediction outlier: check
its values for the X-variables.
Special plot: Predicted with deviation

254
Plots

7.11. Plotting results from several matrices

7.11.1 Why is it useful?


In order to compare different results it can be useful to plot them in the same plot instead of
two separate plots.
Two separate plots

255
The Unscrambler X Main

Two results in one plot

256
Plots

7.11.2 How to do it?


Access to Add Data…
To be able to add data to a plot it is necessary to access to the Add Data… menu. This is
available when creating a custom layout. Begin by going to Insert - Custom Layout. When a
plot is displayed after formatting the custom layout, the Add Data option is accessible from
a right click on a plot displayed in the workspace.
Access Add Data… menu

Add Data… dialog box


The following dialog box opens.
Add Data… dialog box

257
The Unscrambler X Main

It is necessary to locate the second set of data.


Matrix
Use the drop-down list if the data are in a data matrix and use the select result
matrix button if the data are in an analysis result.
Rows and Cols
Use the drop-down list if the subset is already defined and use the Define button if it
has to be defined.

7.12. Annotating plots


It is possible to customize a plot by adding text, lines and drawings to it.
To do this use the Draw toolbar:

Or right click in a plot frame:

Example of an edited plot

258
Plots

In order to remove drawing objects from plots, you can use either the Edit - Undo option (or
toolbar button), or you can select the drawing object using the mouse pointer and click the
keyboard Delete button.

7.13. Create Range Menu


In an interactive analysis it can be very useful to mark some samples in e.g. a Scores plot to
create a new range. To do so, right click on the plot with the marked samples and select the
option Create Range
Create Range Dialog

A dialog with the following frames will open:

259
The Unscrambler X Main

 Sample Selection : Select whether the marked or unmarked samples (or both)
should be extracted from the model, and give the ranges informative names. By
default the marked and unmarked sample ranges will be named Outliers and Good
Samples, respectively.
 Create Range : The new range will be created based on one or more data tables
available in the project navigator. All data tables with the correct number of rows
will be listed in this frame. Use the radio buttons to define whether a new data table
should be created or if the ranges should be added to existing tables. As an
additional quality control it is possible to list only data tables with matching sample
names. A yellow warning sign next to a table indicates that the sample names are
missing or non-matching.

7.14. Plotting: The smart way to display numbers


Mean and standard deviation, PCA scores, regression coefficients: all these results from
various types of analyses are originally expressed as numbers. Their numerical values are
useful, e.g. to compute predicted response values. However, numbers are seldom easy to
interpret as such.
Furthermore, the purpose of most of the methods implemented in The Unscrambler® is to
convert numerical data into information. It would be a pity if numbers were the only way to
express this information!
Thus visualization tools are provided for representation of the main results of the methods
available in The Unscrambler®. The best way, the most concrete, the one which will helps
one to get a real feeling for results, is the following:
A plot!
Most often, a well-chosen picture conveys a message faster and more efficiently than a long
sentence, or a series of numbers. This also applies to raw data – displaying them in a smart
graphical way is already a big step towards understanding the information contained in
numerical data.
However, there are many different ways to plot the same numbers! The trick is to use the
most relevant one in each situation, so that the information which matters most is
emphasized by the graphical representation of the results.

7.14.1 Various plots


Numbers arranged in a series or a table can have various types of relationships with each
other, or be related to external elements which are not explicitly represented by the
numbers themselves. Plotting is a way of seeing the structure. The chosen plot has to reflect
this internal organization, so as to give an insight into the structure and meaning of the
numerical results.
According to the possible cases of internal relationships between the series of numbers, The
Unscrambler® provides seven main types of plots for graphical representation of data:

 Line plot
 Bar plot
 Scatter plot
 3-D scatter plot
 Matrix plot
 Histograms

260
Plots

 Normal probability plot


 Multiple scatter plot

In addition, to cover a few special cases, two more kinds of representations are provided:

 Table plot
 Special plot

7.14.2 Customizing plots

 Zooming and re-scaling


 Formatting plot appearance
 Adding text and drawings
 Grouping samples
 Plotting results from several matrices
 Saving and copying a plot

7.14.3 Actions on a plot


A plot displays some information as points, bars or lines. Those items are displayed
accordingly to their coordinates and values.
It is possible to access this information by pointing at the item. It is also possible to mark the
item for further use.

7.14.4 Plots in analysis


Specific plots for each analysis
When performing an analysis there are some plots that will summarize the information
better than others.
In The Unscrambler® there is a list of predefined plots for each analysis. This list can be
accessed through one of the following:
Navigator
A shortcut to the most important plots can be given in the Plots sub-node of a model
in the project navigaor. The plots are displayed if the right-click model menu option
‘Show Plots’ is toggled on, and can be hidden by using the ‘Hide Plots’ option.
Plot node under a PCA analysis in the navigator

261
The Unscrambler X Main

From the Plot menu


The plot menu changes for each analysis, providing an extensive list of the available
plots.
Plot menu specific to the PCA analysis

From a right click on a plot


The plot menu there is called by the name of the method for example PCA, it
provides the full list of available plots.
Plot menu from a right click on a plot from a PCA analysis

262
Plots

Interpreting plots
To get specific information on all the available plot for each analysis, see the specific Plot
sections under respective methods.

 Design of Experiments
 Descriptive statistics
 Statistical tests
 Principal Component Analysis (PCA)
 Multiple Linear Regression (MLR)
 Principal Components Regression (PCR)
 Partial Least Squares Regression (PLS)
 L-shaped PLS Regression (L-PLS)
 Multivariate Curve Resolution (MCR)
 Cluster analysis
 Projection
 SIMCA
 Prediction

7.15. Kennard-Stone (KS) Sample Selection


The objective with this function is to select subsets of samples to evenly cover the
multivariate space, as originally described by Kennard and Stone 1969. The starting point for
this option is a score plot. This document describes the functionality of the Kennard-Stone
Sample Selection dialog as implemented in The Unscrambler® X.
User Dialog
The user dialog is found by right clicking in a score plot from PCA, PCR or PLS
regression, and then under the option Mark select Kennard-Stone Sample Selection.

263
The Unscrambler X Main

It is also possible to enter the dialog from the icon in the Mark Toolbar

This will open the Kennard -Stone sample selection dialog


Kennard-Stone Sample Selecton

A detailed description of the inputs to the dialog is given below:


Function Description of Functionality

Number of Number of calibration samples to select with the K-S algorithm. The
samples default is 15.

Number of Here the number of components to use for selection is given. The
components default is the optimal number as found in the model.

Pre-Select
When selected any marked samples in the score plot will be included
samples - Include
in the calibration sample set in addition to what is identified with the
already marked
K-S Sample selection.
samples

Pre-Select Opens the Select samples dialog window for selecting samples to be
samples - included in the calibration sample set from the data matrix.

264
Plots

Function Description of Functionality

Manually pre-
select samples

When enabled a row set of the same size as the number of


Select validation
calibration samples will be created as a validation set using the
samples
Double Kennard-Stone sample selection algorithm.

Works only for PCR and PLSR models, when checked the initial
calibration set from K-S will be augmented with samples to produce
Augment set with a more uniform distribution of response values. Additional options
boxcar samples are available for setting number of bins for boxcar samples and
number of samples to select from the sample selection. This option
will be disabled if Select validation samples is checked.

Create row set as When selected the samples will be extracted into a new matrix, with
new matrix KS-Calibration and optionally KS-Validation row sets added.

Create row set in When selected, Calibration and optionally Validation row sets will be
selected matrix(es) added to selected, matching matrices.

Allow mis- While not checked, only matrices with identical sample names in the
matching samples same order will be listed. An exclamation mark is shown for the
names matrices where the sample names do not match.
The figure below shows the score plot after specifying 15 samples for calibration and
validation. The calibration samples are marked with green rectangles and the validation
samples with orange triangles.
The score plot with marked calibration and validation samples

When the option to create the sample set in selected Matrices is chosen, the matrices will
be added in the project navigator as shown below:

265
The Unscrambler X Main

If the option to Create row set as new matrix has been chosen, a matrix with the name of
the X matrix from the scores plot will be created with KS appended to the matrix name.

7.16. Marking
It is often useful to mark some samples or variables in a plot to:

 Create a new range of samples or variables


 Recalculate with modification on those samples or variables (Downweight, exclude,
include only)

7.16.1 How to mark samples/variables


There are several toolbar buttons available to mark a sample or a variable in a plot. The
Mark functions can also be accessed from the Edit - Mark menu, or by right-clicking in a plot
and selecting Mark
TheEdit - Mark* menu*

One by one
This option enables one to use the cursor to select an item to mark by clicking on it.
Rectangular
This option allows several grouped samples to be selected at the same time. The
cursor is transformed into a pointer that will allow the user to define the top left
corner and the bottom right corner of the rectangle.
Samples marked with rectangle option

266
Plots

The different types of Markings can be accessed from Edit-Mark.. or from toolbar shortcuts.
Lasso
This option activates the cursor to be used to define a special area. All samples
inside the area will be marked. To define the area click on the contour of the area to
be defined and maintain the click while defining the contour of the area. When the
click is released the selection is done.
Samples marked with lasso

Evenly distributed samples only…


Automatically mark samples uniformly throughout the data.
For more information see the Select evenly distributed samples documentation.
Kennard-Stone Sample Selection…
Automatically mark representative samples using the Kennard-Stone sample
selection algorithm, or use the double Kennard-Stone to extract both calibration and
validation samples.
For more information see the Kennard-Stone sample selection documentation.

267
The Unscrambler X Main

Mark significant X-variables


This option is available only if:

 Selecting variables from PCA, PCR, PLS and


 Uncertainty test was enabled.

The selection is automatic.


Mark outliers
Add outliers to the current selection. These outliers are based on the warning limits
associated with a given analysis on the Warning Limits tab.
Unmark all
This option is used to remove a previous selection.
Reverse marking
When some items are selected in a plot and one would like to select the unselected
items, i.e. invert the current selection, the button Reverse marking can be used.

7.16.2 How to create a new range of samples or variables from the marked items
Once some samples / variables are selected in a plot it is possible to create a new range
including them. To do so right click on the plot with the selected items and select the option
Create Range.
Menu create range

For all raw data plots and for model plots of variables (e.g. PCA loadings), the new range
appears under the corresponding data table node with the default name “RowRange” or
“ColumnRange”.
New range created

268
Plots

When a sample range is created from within a model scores plot, a dialog is opened to allow
sample extraction into a new or existing data table. See the extract samples documentation
for details.

7.16.3 Recalculate with modifications on marked samples or/and variables


Once some samples / variables are selected in a plot it is possible to perform a new analysis
based on the same parameters as previously used, including a modification affecting the
selected samples or/and variables.
Select the analysis in the project navigator and right click. Select the Recalculate option.
Menu recalculate

Five options are available:


With Marked…
This option allows the user to perform recalculation using the marked/selected
samples or variables for further analysis, the rest are kept out.

Without Marked…
The marked samples or/and variables are not included in the analysis, the
unselected samples or/and variables are.

269
The Unscrambler X Main

With Marked Downweighted…


The marked variables are downweighted. See more information about downweight.
The other variables keep their original weight.
With UnMarked Downweighted…
The unmarked variables are downweighted. See more information about
downweight. The other variables keep their original weight.
With New Data
Additional data can be added to an analysis using this option. This will open a new
dialog from which the new data are selected. These new data can be appended to
the original data or original data in the matrix can be overwritten for the new
analysis.
Add data set

7.17. Point details


In addition to the general information available about the whole plot, one may also display
specific details regarding one particular point. This is done as follows:
 Rest the cursor close to a data point: the point number is displayed.
 Click on the point: a small box containing point number, point name and point
coordinates is displayed as shown in the figure below.
Point details

270
Plots

7.18. Formatting of plots


All the plots can be customized. This is done from the properties dialog which is accessed by
a right click on the plot and the selection of the Properties menu,

or by selecting the properties shortcut from the toolbar


When selecting the Properties menu, the Plot properties dialog appears.
Each of the following items can be modified:
Axis X and its gridlines
The visibility, the title with its font and position, the scale - both its appearance
(logarithmic or reversed) and its labels - and origin can be modified on the X axis.
The axis label rotation can also be set in this menu.
Properties Axis X

271
The Unscrambler X Main

Axis Y and its gridlines


Access to the same possibilities as the Axis X and its gridlines.
Appearance
Five different items can be customized from this menu:

 Background
 Header: title, color, font, visibility, color of the background
 Legend: title, color, font, visibility, color of the background
 Point Label: color, font, visibility
 Axis Label: title, color, font, visibility, borders

Properties Appearance

For the Point Label and Axis Label the text can be edited. One can customize the
name, such as only having part of the name displayed. For this option use the drop-
down list in Label layout - Show.
Properties: Point Label

272
Plots

Graphic Objects
It is possible to include some graphical objects in the plot such as line, arrow,
rectangle, ellipse and text. Each of those objects can be configured in terms of color,
thickness and font if necessary.
Properties Appearance

Chart properties
It is possible to further customize the chart properties by selecting More, which will
open up the Chart properties dialogue. Here one can define simple or complex chart
types from the options in the chart gallery. Further selection of chart properties can
be made, and the chart previewed.
Chart Properties

273
The Unscrambler X Main

7.19. Formatting of 3D plots


All the plots can be customized. This is done from the properties dialog which is accessed by
a right click on the plot and the selection of the Properties menu,

or by selecting the properties shortcut from the toolbar


When selecting the Properties menu, the Plot properties dialog appears.
Each of the following items can be modified:
Axis X, its gridlines and axis labels
The visibility, the title with its font and position, the scale - both its appearance
(logarithmic or reversed) and its labels - and origin can be modified on the X axis.
The axis label rotation can also be set in this menu.
Properties Axis X

274
Plots

Axis Y and Z and its gridlines


Access to the same possibilities as the Axis X and its gridlines.
Appearance
Three different items can be customized from this menu:

 Background
 Header: title, color, font, visibility, color of the background
 Legend: title, color, font, visibility, color of the background
 Plot Area: Chart area, color, font, visibility, borders, surface

Properties Appearance

For the Header and Legend the text can be edited. One can customize the name,
such as only having part of the name displayed, the font and the color.

275
The Unscrambler X Main

Graphic Objects
It is possible to include some graphical objects in the plot such as line, arrow,
rectangle, ellipse and text. Each of those objects can be configured in terms of color,
thickness and font if necessary.
Properties Graphic Objects

Chart properties
It is possible to further customize the chart properties by selecting More, which will
open up the 3D Chart properties dialogue. Here one can define the chart types from
the options in the chart gallery.
Chart Properties

276
Plots

Additional options of a 3-D plot can be changed from the tab in the properties dialog. In the
Data tab, the layout of the data can be changed.
3-D Scatter plot data properties dialog

The rotation, perspective, and axis scales can be changed under the 3-D view tab.
3-D Scatter plot 3-D view properties dialog

277
The Unscrambler X Main

7.20. Plot – Response Surface…


This dialog opens when clicking on the predefined plot “Response Surface” or when clicking
in the Plot - Response Surface menu when regression results are opened.
It contains four fields:
Y Variable
This is the response variable to be plotted. Use the drop-down list to select one.
Factor
This is only for PLS and PCR but not for MLR. Select the optimal number of factors to
be used. This affects the Beta coefficients and thus the response surface.
X Variable - 1
The predictor variable to be used in the first direction.
X Variable - 2
The predictor variable to be used in the second direction.

Click OK to generate the response surface or Cancel to go back to the viewer.

278
Plots

7.21. Saving and copying a plot

7.21.1 Saving a plot


Access Save Plot… menu
A plot can be saved from the Save Plot… menu. It is accessible from a right click on a plot
displayed in the workspace.
Save Plot… menu

Save As… dialog box


The following dialog box opens.
Save As… dialog box

Select where the plot should be stored in the field Save in.

279
The Unscrambler X Main

Enter a name for the plot in the field File name and select a format.
Types of format
There are six possible graphics file formats available for compatibility with many needs:
EMF
Use the EMF format which is vector graphics whenever possible. Vector graphics can
be scaled and will give the best quality.
Compatibility: EMF support is often limited to Microsoft applications. When sending
the plot graphics file for instance by email, a recipient may encounter problems
viewing and reusing it.
PNG
The second choice is PNG, which is raster graphics, and does not look as good when
enlarged.
This format is most suitable for web publishing and email.
This will generally result in smaller files than the following formats.
Compatibility: 5-10 year old applications may not support this image format.
Select one of the above formats. The following formats are also raster graphics, each having
it’s limitations. Included only for compatibility.
GIF
Limited to 256 colors.
JPEG
Lossy compression that will give artifacts. (JPEG is best suited for photographic
images.)
TIFF
Will produce larger files.
BMP
Will produce larger files.
Available image formats

7.21.2 Copying plots


It is possible to copy either one plot or all plots displayed in the workspace.
Copy one plot

Access Copy menu


The Copy menu is available from two places:
From right click on a plot
Right click on a plot and select Copy.
Copy from right click

280
Plots

From Edit menu


Go to the Edit menu and select Copy.
Copy from Edit menu

Copy from clipboard


The shortcut Ctrl+C is a fast way to copy a plot.
Copy all plots

Access Copy All menu


The Copy All menu is also accessible from a right click on a plot displayed in the workspace.

Result of Copy All


After pasting, the plots that were displayed on the workspace will be shown without
borders.

281
The Unscrambler X Main

Example of Copy All

Pasting plots
Depending on the application to be used there may be different options such as the shortcut
Ctrl+V or from an Edit menu.

7.22. Scope: Select plot range


When creating a plot, it is necessary to define the scope of the plot in terms of:

 Data set (matrix),


 Samples (row range),
 Variables (column range).

A common dialog appears when selecting any of the plotting options from Plot:

 Line
 Bar
 3D Scatter
 Matrix
 Histogram
 Normal Probability
 Multiple Scatter

Define the row and column ranges from predefined ranges using the drop-down list.
To use new ranges, click on icon that looks like a matrix to access a matrix from the project
navigator and on Define to access the Define Rangeramework\menu2-edit\range.htm)
dialog.
Plot scope dialog

282
Plots

To use data that are part of a results matrix, use the select result matrix button to
choose the desired results matrix.

7.23. Edit – Select Evenly Distributed Samples


This tool allows users to automatically select a representative subset of the samples in any
plot of samples. The selection can be used to create a range.
Evenly distributed samples dialog

Min/Max
Selects the samples most separated in the data set.
A number of extreme samples will be picked out for each PC, according to the
specification in the right column in the table below the method choice. It will be
labeled Number of min/max, and for each min/max selected, two extreme samples
are marked (max and min value). Thus, setting the number to 2 will mark a total of
four samples.
Classes
The samples will be divided into a number of classes for each PC. One pair of
extreme samples (max and min value) will be picked out for each PC, according to a
user’s specification in the right column in the list below the Methods field. It will be

283
The Unscrambler X Main

labeled Number of classes, and for each class, two extreme samples are marked.
Thus, setting the number to 2 will mark a total of four samples.
Then, in the list below the method choice, specify the number of PCs (listed in the left
column) for which to mark samples, and how many (listed in the right column). No samples
are marked for PCs with 0 in the right column, i.e., in the above figure, only PC 1 is marked.

7.24. Zooming and Rescaling

7.24.1 General options


When a plot is displayed in the view pane, it is possible to modify this view by several scaling
options:
Full-screen
To view a plot in full-screen mode select it by clicking on it and use the Full-screen
button .
The plot will be expanded in full-screen mode. To come back to the usual view in the
view pane, right click on the expanded plot.
Zoom-in
To zoom in a displayed plot, the zoom-in being down in the center area, there are
two options:

 Use the Zoom-in button

 Use the keyboard: Ctrl+Up-arrow

 Use the scroll wheel: Scroll up or left

Zoom-out
To zoom out a displayed plot, the zoom-out being down from the center area, there
are two options:

 Use the Zoom-out button

 Use the keyboard: Ctrl+Down-arrow

 Use the scroll wheel: Scroll down or right

Frame-scale
To zoom in a special area it is more convenient to define the area to zoom-in with a
rectangle. To access this functionality use the Frame-scale button .
A cross will appear, which is to be used to define the area to zoom into. A dotted
rectangle will appear around the defined frame and when releasing, the zoom will
be performed.
Defining the frame to zoom-in

284
Plots

Move
It is possible to move inside the plot itself. To do so use the keyboard: Ctrl+Shift.
Auto-scale
To come back to the original view of the plot defined by The Unscrambler® use the
Auto-scale button

7.24.2 Special options


For Matrix and 3D-Scatter there are two ways to zoom-in:

 Using the mouse wheel, will zoom the points and bars within the cube
 Using Ctrl+Left mouse drag up and down, will zoom the cube itself

7.24.3 Resize plots

From the viewer one can drag the four-pin view to other sizes by choosing the center + sign
to view.

285
8. Design of Experiments
8.1. Experimental design
Experimental design is a strategy to gather empirical knowledge, i.e. knowledge based on
the analysis of experimental data and not on theoretical models. It can be applied when
investigating a phenomenon in order to gain understanding or improve performance.
Building a design means carefully choosing a small number of experiments that are to be
performed under controlled conditions.
Learn about the concepts and methods of experimental design in the Introduction to Design
of Experiments section.
Learn how to use the Design of Experiments tools offered by The Unscrambler®:

 Create a design using Insert – Create design…


 Modify or extend an existing design using Tools – Modify/Extend Design…
 Analyze the experimental results using Tasks – Analyze – Analyze Design Matrix…
 Interpret the analytical results

8.2. Introduction to Design of Experiments (DoE)


The aim of multivariate data analysis is to extract the maximum amount of information from
a data table. The data can be collected from various sources or designed with a specific
purpose in mind.

 DoE basics
 Why use experimental design?
 What is experimental design?
 Investigation stages and design objectives
 Screening
 Factor Influence Study
 Optimization
 Available designs in The Unscrambler®
 Types of variables in experimental design
 Design vs. non-design variables
 Continuous vs. category variables
 Mixture variables
 Process variables
 Designs for unconstrained screening situations
 Full-factorial designs
 Fractional-factorial designs
 Plackett-Burman designs
 Designs for unconstrained optimization situations
 Central composite designs
 Box-Behnken designs
 Designs for constrained situations
 Mixture designs
 Axial designs: Screening of mixture components
 Simplex-centroid designs: Optimization of mixtures

287
The Unscrambler X Main

 Simplex-lattice designs: Cover the mixture region evenly


 D-optimal designs
 Designs with simple linear constraints
 Non-simplex mixture designs
 Process/mixture designs
 Types of samples in experimental design
 Sample order in a design
 Blocking
 Extending a design
 Building an efficient experimental strategy
 Analyze results from designed experiments
 Simple data checks and graphical analysis
 Analysis Of Variance (ANOVA)
 Checking the adequacy of the model
 Analysis of effects using classical methods
 Response surface analysis using classical methods
 Limitations of ANOVA
 Analysis with PLS Regression
 When data are missing or experimental conditions have not been reached
 Advanced topics for unconstrained situations
 Advanced topics for constrained situations

8.2.1 DoE basics


Why use experimental design?
When collecting new data for multivariate modeling, one should pay attention to the
following criteria:

 Efficiency: get more information from fewer experiments;


 Focusing: collect only the information that is really needed.

There are four basic ways to collect data for an analysis:

 Obtain historical data (from a database, from plant records, etc.). However such
data may be biased by changes occurring during the period between acquisition and
analysis. It is anyhow a good start to get some general trends and ideas.
 Collect new data: record measurements directly from the production line, for
example, make observations in fish farms, process development lab, formulation
lab, etc. This will ensure that the data apply to the system being studied today (not
another system, three years ago). However most processes tend to be kept under
tight control and variation is minimal. This may lead to problems finding enough
variability to develop a model.
 Run specific experiments by disturbing (exciting) the system being studied. Thus the
data will encompass more variation than is to be naturally expected in a stable
system running as usual.
 Design experiments in a structured, mathematical way. By choosing symmetrical
ranges of variation and applying this variation in a balanced way among the
variables being studied, one will end up with data where effects can be studied in a

288
Design of Experiments

simple and powerful way. With designed experiments there is a better possibility of
testing the significance of the effects and the relevance of the whole model.

Experimental design (commonly referred to as DoE) is a useful complement to multivariate


data analysis because it generates “structured” data tables, i.e. data tables that contain an
important amount of structured variation. This underlying structure will then be used as a
basis for multivariate modeling, which will guarantee stable and robust models.
More generally, careful sample selection increases the chances of extracting useful
information from the data. When one has the possibility to actively perturb the system
(experiment with the variables), these chances become even greater. The critical part is to
decide which variables to change, the intervals for this variation, and the pattern of the
experimental points.
What is experimental design?
Experimental design is a strategy to gather empirical knowledge, i.e. knowledge based on
the analysis of experimental data and not on theoretical models. It can be applied when
investigating a phenomenon in order to gain understanding or improve performance.
Building a design means carefully choosing a small number of experiments that are to be
performed under controlled conditions. There are four interrelated steps in building a
design:

 Define the objective of the investigation: e.g. “better understand” or “sort out
important variables” or “find the optimum conditions”.
 Define the variables that will be controlled during the experiment (design variables),
and their levels or ranges of variation.
 Define the variables that will be measured to describe the outcome of the
experimental runs (response variables), and examine their precision.
 Choose among the available standard designs the one that is compatible with the
objective, number of design variables and precision of measurements, and has a
reasonable cost.

Most of the standard experimental designs can be generated in The Unscrambler® once the
experimental objective, the number (and nature) of the design variables, the nature of the
responses and the economical number of experimental runs have been defined. Generating
such a design will provide the user with the list of all experiments to be performed in order
to gather the required information to meet the objectives.

8.2.2 Investigation stages and design objectives


Depending on the stage of the investigation, the amount of information to be collected and
the resources that are available to achieve the goal, it is important to choose an adequate
design among those available in The Unscrambler®. The following describes the most
common standard designs for dealing with the various data types and situations described
above.
Screening
When starting a new investigation or a new product development, there is usually a large
number of potentially important variables. At this stage, the main objective of the
experimental work is to find out which are the most important variables. This is achieved by
including many variables in the design, and roughly estimating the effect of each design

289
The Unscrambler X Main

variable on the responses with the help of a screening design. The variables which have
“large” effects can be considered as important. The isolated effects of single variables are
known as main effects and the purpose of screening designs is to isolate these only. There
are several ways to judge the importance of a main effect, for instance significance testing or
use of a normal probability plot of effects.
Some screening designs are capable of estimating interaction effects. These occur when the
effect of changing one variable depends on the level of other variables in the study. Some
variables may be important even though they do not seem to have an impact on the
response by themselves. The reason is that the presence of interaction effects may mask
otherwise significant main effects.
Models for screening designs
The user must choose the adequate form of the model that relates response variations to
variations in the design variables. This will depend on how precisely one wants to screen the
potentially influential variables and describe how they affect the responses. The
Unscrambler® contains two standard choices:

 The simplest form is a linear model. Choosing a linear model will allow one to
investigate main effects only with possible check for curvature effect;
 To study the possible interactions between several design variables, one will have to
include interaction effects in the model in addition to the linear effects.

When building a mixture or D-optimal design, one must choose a model form explicitly,
because the adequate type of design depends on this choice. For other types of designs, the
model choice is implicit in the design that has been selected.
Factor Influence Study
After an initial screening design has been performed and a number of important variables
have been isolated, a Factor Influence study can be performed using full factorial, or high
resolution fractional factorial designs. These are used to further study the main effects of
the variables, but also, they are used to investigate interactions of various orders: two factor
interactions involve two design variables, three factor interactions involve three variables
etc. The importance of an interaction can be assessed with the same tools as for main
effects.
Design variables that have an important main effect are important variables. Variables that
participate in an important interaction, even if their main effects are negligible, are also
important variables. The models generated in a factor influence study usually perform well
as predictive models and form the basis for optimization designs.
Optimization
At a later stage of investigation, when the variables that are important are already known,
one may wish to study the effects of these variables in more detail. Such a purpose will be
referred to as optimization. At the analysis stage this is also referred to as response surface
modeling.
Objectives of optimization
Optimization designs actually cover quite a wide range of objectives. They are particularly
useful in the following cases:
 Maximizing a single response, i.e. to find out which combination of design variable
levels leads to the maximum value of a specific response, and what this maximum
response is.

290
Design of Experiments

 Minimizing a single response, i.e. to find out which combination of design variable
levels leads to the minimum value of a specific response, and what this minimum is.
 Finding a stable region, i.e. to find out which combination of design variable levels
corresponds to a specific target response, with the added criterion that small
deviations from those settings would cause negligible change in the response value.
 Finding a compromise between several responses, i.e. to find out which combination
of design variable levels leads to the best compromise between several responses.
 Describing response variations, i.e. to model response variations inside the
experimental region as precisely as possible in order to predict what will happen if
the settings of some design variables were changed in the future.
Models for optimization designs
The underlying idea of optimization designs is that the model should be able to describe a
response surface which has a minimum or a maximum inside the experimental range. To
achieve that purpose, linear and interaction effects are not sufficient. An optimization model
should also include quadratic effects, i.e. square effects, which describe the curvature of a
surface.
A model that includes linear, interaction and quadratic effects is called a quadratic model.

8.2.3 Available designs in The Unscrambler®


The designs with their fields of application and the allowed number of design variables are
listed below.
Available types of experimental design
Number
Type of Factor
Screening Optimization Field of Use of design
Design Influence
variables

Study the effects of a


low number of design
variables
independently from
Full
each other, including
Factorial X X 2-9
interaction terms. The
Design
only design that allows
for categorical
variables with 3 or
more levels

Depending on the
number of variables,
choose to study lower
order effects
Fractional
independently from
Factorial X X 3 - 13
each other, or create a
Design
screening design
aimed at find the most
important main effects
among many

Plackett- Economical alternative


X 8 - 35
Burman to fractional factorial

291
The Unscrambler X Main

Number
Type of Factor
Screening Optimization Field of Use of design
Design Influence
variables

Design designs, study main


effects only. Complex
interaction effects

Finds the optimal


levels of the design
variables by adding a
Central
few more experiments
Composite X 2-6
to a full factorial
Design
design. All design
variable must be
continuous

An alternative to
central composite
designs, when the
optimum response is
not located at the
Box- extremes of the
Behnken X experimental region 3 - 6
Design and when previous
results from a factorial
design are not
available. All design
variables must be
continuous

Some design variables


have multilinear
constraints, and
D-Optimal
X X X design is not 2 - 9
Design
orthogonal. Analysis
usually by Partial Least
Squares Regression

Contains mixture
Axial variables only, design
(Mixture) X region is simplex. Only 3 - 20
Design linear (first order)
effects can be found.

Contains mixture
Simplex-
variables only, design 3 - 6 (9 if
Lattice
X X X region is simplex. linear
(Mixture)
Tuneable lattice only)
Design
degree (order)

292
Design of Experiments

Number
Type of Factor
Screening Optimization Field of Use of design
Design Influence
variables

Simplex-
Contains mixture
Centroid
X variables only, design 3 - 6
(Mixture)
region is simplex
Design
A D-Optimal design will be used with mixture variables if the experimental region is not a
simplex, or if there is a combination of mixture and process variables in the design. The
design region is often non-simplex when upper limit constraints are added to some of the
mixture components.

8.2.4 Types of variables in experimental design


This section introduces the nomenclature of variable types used in The Unscrambler®. Most
of these names are commonly used in the standard literature on experimental design;
however the use made of these names may differ somewhat between different softwares or
fields. Therefore it is recommended that the user reads this section before proceeding to
more details about the various types of designs.
Design vs. non-design variables
In The Unscrambler®, all variables appearing in the context of designed experiments can be
categorized as either design or non-design variables.
Design variables
Performing designed experiments is based on controlling the variations of the variables that
are being investigated to study their effects. Such variables with controlled variations are
called design variables, or factors.
In The Unscrambler®, a design variable is completely defined by:

 Its name;
 Its type: continuous or category;
 Its constraints: mixture, linear;
 Its levels.

Response variables
This is a type of non-design variables, they are the measured output variables that describe
the outcome (usually a quality attribute) of the experiments. These variables may often be
subject to an optimization.
Non-controllable variables
This second type of non-design variables refers to variables that can be monitored and may
have an influence on the response variables but that cannot controlled or reliably be fixed to
a value. For example the air humidity or the temperature of a plant.
Continuous vs. category variables
All variables have a pre-defined format or data type, and this format defines how the
variables are treated numerically and how they should be interpreted.
Continuous variables
All variables that have numerical values and that can be measured quantitatively are called
continuous variables. Note that this definition also covers discrete quantitative variables,

293
The Unscrambler X Main

such as counts. It reflects the implicit use which is made of these variables, namely the
modeling of their variations using continuous functions.
Examples of continuous variables are: temperature, concentrations of ingredients (e.g. in %),
pH, length (e.g. in mm), age (e.g. in years), number of failures in one year, etc.
The variations of continuous design variables are usually set within a predefined range,
which goes from a lower level to an upper level. Those two levels have to be specified when
defining a continuous design variable. More levels between the extremes may be specified if
the values are to be studied more specifically.
If only two levels are specified, the other necessary levels will be computed automatically.
This applies to center samples (which use a mid-level, halfway between lower and upper),
and axial (star) samples in optimization designs (which use extreme levels outside the
predefined range).
Category variables
In The Unscrambler®, all non-continuous variables are called category variables. Their levels
can be named, but not measured quantitatively. Examples of category variables are: color
(Blue, Red, Green), type of catalyst (A, B, C, D), place of origin (Africa, The Caribbean Islands,
…), etc.
Binary variables are a special type of category variables that have only two levels
(sometimes referred to as dichotomous). Examples of binary variables are: use of a catalyst
(Yes/No), recipe (New/Old), type of electric power (AC/DC), type of sweetener (Artificial/
Natural), etc.
For each category variable, the user must specify all levels. The number of levels can vary
between 2 - 20.
Note: Since there is a kind of quantum jump from one level to another (there is no
intermediate level in between), center samples cannot be defined for category
variables. If there is a mix of category and continuous variables in the design, center
samples are defined for all continuous variables at each level of the category
variables.

Mixture variables
When performing experiments where some ingredients are mixed according to a recipe, one
may be in a situation where the amounts of the various ingredients cannot be varied
independently from each other. In such a case, one will need to use a special kind of design
called a Mixture design, and the design variables are called mixture variables (or mixture
components).
An example of a mixture situation is blending concrete from the following three ingredients:
cement, sand and water. If the percentage of water in the blend is increased by 10%, the
proportions of one of the other ingredients (or both) will have to be reduced so that the
blend still amounts to 100%.
However, there are many situations where ingredients are blended, which do not require a
mixture design. For instance in a water solution of four ingredients whose proportions do
not exceed a few percent, one may vary the four ingredients independently from each other
and just add water at the end as a “filler”. Therefore it is important to carefully consider the
experimental situation before deciding whether the recipe being followed requires a mixture
design or not!
Process variables
In a mixture situation, one may also want to investigate the effects of variations in some
other design variables which are not themselves a component of the mixture. Such variables

294
Design of Experiments

are called process variables in The Unscrambler®, and these are analyzed using a D-optimal
design.
Typical process variables are: temperature, stirring rate, type of solvent, amount of catalyst,
etc.

8.2.5 Designs for unconstrained screening situations


The Unscrambler® provides three classical types of screening designs for unconstrained
situations:
 Full-factorial designs for a number of design variables usually between 2 and 5
(maximum 9); the design variables may be two-level continuous or category with 2
to 20 levels.
 Fractional-factorial designs for any number of two-level design variables (continuous
or category) between 3 and 13.
 Plackett-Burman designs for any number of two-level design variables (continuous
or category) between 8 and 35.
Full-factorial designs
Full-factorial designs combine all defined levels of all design variables. For instance, a full-
factorial design investigating one two-level continuous variable, one three-level category
variable and one four-level category variable will include 2x3x4=24 experiments (excluding
center points).
Among other properties, full-factorial designs are perfectly balanced, i.e. each level of every
design variable is studied an equal number of times in combination with every level of the
other design variables.
Full-factorial designs include enough experiments to allow use of a model with all
interactions included. This can be very beneficial if the number of design variables is low,
however it comes at the prize of having to perform a high number of experiments if more
than a few variables are included. In this case, a fractional factorial design should be
considered.
Note: In theory a full factorial design can accommodate any number of levels also
for continuous variables, and such a design could be used for optimization. Because
central composite and Box-Behnken designs are much more economical than a 3-
level (or higher) full-factorial design, only two levels are allowed for continuous
variable factorial designs in The Unscrambler.

Fractional-factorial designs
In the specific case where there are only two-level variables (continuous with lower and
upper levels, and/or binary variables), one can define fractions of full factorial designs that
enable the investigation of as many design variables as the chosen full-factorial designs with
fewer experiments. These “economic” designs are called fractional factorial designs.
Given that a full-factorial design suitable for the investigation has already been defined, a
fractional design might be set up by selecting half the experimental runs of the original
design. For instance, one might try to study the effects of three design variables with only 4
(2(3-1)) instead of 8 (23) experiments. Larger factorial designs admit fractional designs with a
higher degree of fractionality, i.e. even more economical designs, such as investigating nine
design variables with only 16 (2(9-5) ) experiments instead of 512 (29). Such a design can be
referred to as a fractional design; its degree of fractionality is 5. This means that one
investigates nine variables at the usual cost of four (thus saving the cost of five).

295
The Unscrambler X Main

Example of a fractional-factorial design


In order to better understand the principles of fractionality, the following illustrates how a
fractional factorial is built in the following concrete case: computing the half-fraction of a full
factorial with four variables (2 (4-1)).
In the following tables, the design variables are named A, B, C, D, and their lower and upper
levels are coded – and +, respectively.
First, the full factorial design is built with only variables A, B, C (2 ³), as shown below:
Full-factorial design 2³
Experiment A B C

1 – – –

2 + – –

3 – + –

4 + + –

5 – – +

6 + – +

7 – + +

8 + + +
In the table below additional columns are generated, which are computed from the products
of the original three columns A, B, C. These additional columns represent the interactions
between the design variables.
Full-factorial design 2³ with interaction columns
Experiment A B C AB AC BC ABC

1 – – – + + + –

2 + – – – – + +

3 – + – – + – +

4 + + – + – – –

5 – – + + – – +

6 + – + – + – –

7 – + + – – + –

8 + + + + + + +
The above design table is an example of an orthogonal table, i.e. the effect of each column
(main effect and interaction) can be estimated independently of each other.
In the table below, the column representing the highest degree of interaction (the ABC
interaction) is assigned to the variable, D, as it is assumed that the ABC interaction is
negligible:
Fractional factorial design 2(4-1)
Experiment A B C D

296
Design of Experiments

Experiment A B C D

1 – – – –

2 + – – +

3 – + – +

4 + + – –

5 – – + +

6 + – + –

7 – + + –

8 + + + +
This new design allows the main effects of the four design variables to be studied
independently of each other; but what about their interactions? The table below shows all
of the two-factor interactions calculated after setting D = ABC.
Fractional-factorial design 2(4-1)) with interaction columns
Experiment A B C D AB = CD AC = BD BC = AD

1 – – – – + + +

2 + – – + – – +

3 – + – + – + –

4 + + – – + – –

5 – – + + + – –

6 + – + – – + –

7 – + + – – – +

8 + + + + + + +
This table shows that each of the last three columns is shared by two different interactions
(for instance, AB and CD share the same column).
Confounding
Unfortunately, as the above example shows, there is a price to be paid for saving on the
experimental costs! “He who invests less, will also harvest less”.
In the case of fractional factorial designs, this means that if one does not use the full-
factorial set of experiments, it is not possible to study the interactions as well as the main
effects of all design variables. This happens because of the way those fractions are built,
using some of the resources that would otherwise have been devoted to the study of
interactions, to study main effects of more variables instead.
This side effect of using fractional designs is called confounding. Confounding means that
some effects cannot be studied independently of each other.
For instance, in the above example, the two-factor interactions are all confounded with each
other. The practical consequences are the following:

297
The Unscrambler X Main

 All main effects can be studied independently of each other, and independently of
the interactions;
 If the objective is to study the interactions themselves, using this specific design will
only enable one to detect whether either of the confounded interactions are
important. The experiments will not allow one to decide which are the important
ones. For instance, if AB (confounded with CD, “AB=CD”) turns out as significant, one
will not know whether AB or CD (or a combination of both) is responsible for the
observed effect.

The list of confounded effects is called the confounding pattern of the design.
Resolution of a fractional factorial design
How well a fractional-factorial design avoids confounding is expressed through its resolution.
The three most common cases are as follows:

 Resolution III designs: Main effects are confounded with two-factor interactions.
 Resolution IV designs: Main effects are free of confounding with two-factor
interactions, but two-factor interactions are confounded with each other.
 Resolution V designs: Main effects and two-factor interactions are free of
confounding with each other, however some two-factor interactions are
confounded with three-factor interactions.

Definition: In a resolution R design, effects of order k are free of confounding with all effects
of order less than R-k.
In practice, before deciding on a particular factorial design, it is important to check its
resolution and its confounding pattern to make sure that it fits the experimental objectives!
Examples of factorial designs
A screening situation with three design variables is illustrated in the two examples below:
Options for screening design with three design variables

Full factorial (left) and fractional factorial (right) designs illustrated. The design points are
marked red. The points in the fractional factorial design are selected so as to cover the
maximum volume of the design space.
Plackett-Burman designs
If the experimental objective is to study the main effects only, and there are many design
variables to investigate (e.g. > 10), Plackett-Burman (PB) designs may be the solution. They
are very economical, since they require only one to four more experiments than the number
of design variables.
Plackett–Burman designs (Plackett and Burman, 1946) are experimental designs developed
while the authors were working in the British Ministry of Supply. Their goal was to find

298
Design of Experiments

experimental designs for investigating the dependence of some measured responses on a


number of independent variables (factors), each taking L levels. The designs were developed
in such a way as to minimize the variance of the estimates of these dependencies using a
limited number of experiments. Interactions between the factors were considered
negligible. The solution to this problem is to find an experimental design in which each
combination of levels for any pair of factors appears the same number of times. A complete
factorial design would satisfy this criterion, but the idea was to find smaller designs. An
example of a PB design is provided below.
Plackett–Burman design for 12 runs and up to 11 two-level factors
Run A B C D E F G H J K L

1 + − + − − − + + + − +

2 + + − + − − − + + + −

3 − + + − + − − − + + +

4 + − + + − + − − − + +

5 + + − + + − + − − − +

6 + + + − + + − + − − −

7 − + + + − + + − + − −

8 − − + + + − + + − + −

9 − − − + + + − + + − +

10 + − − − + + + − + + −

11 − + − − − + + + − + +

12 − − − − − − − − − − −
For the case of two levels (L=2), Plackett and Burman used the construction of Paley (Paley,
1933) for generating orthogonal matrices whose elements are all either 1 or -1 (Hadamard
matrices). Paley’s method could be used to find such matrices of N rows for most N equal to
a multiple of 4. In particular, it worked for all such N up to 100 except N = 92. If N is a power
of 2, however, the resulting design is identical to a fractional factorial design. In The
Unscrambler® the maximum limit of N is 36, which can accommodate n = N-1 = 35 design
variables (main effects). If there are less than N-1 effects to estimate, a subset of the
columns of the matrix is used.
The prize to pay for estimating all these effects in a minimum number of runs, is the very
complex confounding patterns of Plackett-Burman designs. Main effects are often partially
confounded with several interactions, and these designs should therefore be used very
carefully.

8.2.6 Designs for unconstrained optimization situations


The Unscrambler® provides two classical types of optimization designs:

 Central Composite designs for 2 to 6 continuous design variables;


 Box-Behnken designs for 3 to 6 continuous design variables.

299
The Unscrambler X Main

Central composite designs


Central composite designs (CCD) are extensions of two-level full factorial designs. A CCD
enables a quadratic model to be fitted by including new levels in addition to the regular
lower and upper levels.
A CCD consists of three types of experiments:

 Factorial (cube) samples are experiments which combine the regular lower and
upper levels of the design variables; they are the “factorial” part of the design;
 Center samples are replicates of the experiment for which all design variables are at
their mid-level;
 Axial (star) samples are located such that they extend beyond the factorial levels of
the design for one factor at the time, all other design variables being at their mid-
level. These samples are specific to CCD designs.

Properties of a CCD
The properties of the simplest CCD, with two design variables is shown below.
Central composite design with two design variables

From the figure it can be seen that each design variable has five levels: 1) low axial, 2) low
factorial, 3) center, 4) high factorial, and 5) high axial. Low factorial and high factorial are the
lower and upper levels that are specified when defining the design variable.

 The four factorial samples are located at the corners of a square (or a cube if there
are three variables, or a hypercube if there are more);
 The center samples are located at the center of the square;
 The four axial samples are located outside the square; by default, their distance to
the center is set to ensure rotatability (see below).

Because we do not know the position of the response surface optimum, we try to ensure
that the prediction error is the same for any point at the same distance from the center of
the design. This property is called rotatability, as the design axes can be rotated around the
origin without influencing the variance of the predicted response. This implies that the
information carried by any design point will have equal weight on the analysis, i.e. the design
points will have equal leverage. This property is important if one wants to achieve uniform
quality of prediction in all directions from the center. The distance that ensures rotatability
is given by 2k/4, k being the number of factors.
A spherical design is one in which all factorial and axial points have the same distance from
the origin. The 2- and 4- factor rotatable designs are also spherical designs (distance given by
k1/2).

300
Design of Experiments

Types of CCD
Circumscribed central composite design (CCC)
This general type is the one described in the previous section, with factorial points
defined at the lower and upper levels and with axial points outside of these ranges.
Faced central composite design (CCF)
If for some reason one cannot use levels outside the factorial range, one can tune
the axial point distances down such that these points lie at the center of the cube
faces. This is called a faced central composite design (CCF). CCF designs are not
rotatable.
Inscribed central composite design (CCI)
Another way to keep all experiments within the pre-defined range is to use an axial
sample distance that ensures rotatability, but to shrink the entire design such that
the axial points fall on the pre-defined levels. This will result in a smaller investigated
range, but will guarantee a rotatable design. This is called an inscribed central
composite design (CCI).
Efficiency of the CCD
Depending on the constraints of the experiments and the accuracy to achieve, select the
appropriate CC design using the following table:
Central composite design: constraints and accuracy
Number of Uses point outside
Design Accuracy of estimates
levels high and low levels

Circumscribed 5 Yes Good over entire design space

Good over central subset of the


Inscribed 5 No
design space

Fair over entire design space, poor


Faced 3 No
for pure quadratic coefficients

Box-Behnken designs
Box-Behnken designs are not built on a factorial basis, but they are nevertheless good
optimization designs for second order models.
In a Box-Behnken design, all design variables have three levels: low cube, center, and high
cube. Each experiment combines the extreme levels of two or three design variables with
the mid-levels of the others. In addition, the design includes a number of center samples.
The properties of Box-Behnken designs are the following:

 The actual range of each design variable is low cube to high cube, which makes it
easy to handle;
 All non-center samples are located on a sphere, achieving rotatability for the 4-
factor design, and almost rotatability for the designs with 3, 5, or 6 factors.

Box-Behnken design: constraints and accuracy


Number of Uses point outside
Design Accuracy of estimates
levels high and low levels

Good over entire design space, more


Box
3 No uncertainty on the edge of the design
Behnken
area

301
The Unscrambler X Main

Examples of optimization designs


A central composite design for three design variables is shown here:
Central composite design with three design variables

The figure below shows the Box-Behnken design drawn in two different ways. In the left
drawing one can see how it is built, while the drawing to the right shows how the design is
rotatable.
Box-Behnken design

8.2.7 Designs for constrained situations


This chapter introduces “tricky” situations in which classical designs based upon the factorial
principle do not apply. Here two related cases will be discussed:

 General constraints in which the allowed levels of a design variable depend on the
levels of one or more of the other design variables: linear constraints;
 The special case of mixture situations, in which the levels of all design variables sum
to a fixed, total amount.

Each of these situations will then be described extensively in the following sections.
Note: Understanding the sections that follow requires basic knowledge about the
purposes and principles of experimental design. If the principles of experimental
design are unfamiliar, the user is strongly urged to read about it in the previous
sections (see What Is Experimental Design?) before proceeding with this section.

Mixture designs
A simple mixture design example
We will start describing the mixture situation by using an example.
A product development specialist has a specific problem to solve related to the optimization
of a pancake mix. The mix consists of the following ingredients: wheat flour, sugar and egg
powder. It will be sold in retail units of 100 g, to be mixed with milk for reconstitution of
pancake batter.

302
Design of Experiments

The product developer has learned about experimental design, and tries to set up an
adequate design to study the properties of the pancake batter as a function of the amounts
of flour, sugar and egg in the mix. She starts by plotting the region that encompasses all
possible combinations of those three ingredients, and soon discovers that it has a distinct
shape.
The pancake mix experimental region

The reason, as you may have guessed, is that the mixture always has to add up to a total of
100 g. This is a special case of multilinear constraint, which can be written with a single
equation:

Flour + Sugar + Egg = 100


This is called the mixture constraint: the amounts of all mixture components always have to
sum to 100% of the total product. This means that if you know the amounts of flour and
sugar in the mix, the amount of egg can be deduced by subtraction from 100%. In other
words, even if there are three mixture components, only two of them can be varied
independently at any time. The practical consequence is that the mixture region defined by
three ingredients is not a three-dimensional region! It is contained in a two-dimensional
surface called a simplex.
A simplex is a generalization of a triangle in possibly higher dimensions. If there are N
mixture components, the dimensionality of the simplex is N-1. For instance, for 4 mixture
components, the simplex is a tetrahedron. There is a special class of designs called mixture
designs which are based on regular simplexes.
Designs based on a simplex
Since the region defined by the three mixture components in the previous example is a two-
dimensional surface, we cannot use a factorial design to analyze the design region. Rather,
the design region is given below.
The pancake mix simplex

303
The Unscrambler X Main

This simplex contains all possible combinations of the three ingredients flour, sugar and egg.
One can see that it is completely symmetrical. One could substitute egg for flour, sugar for
egg and flour for sugar in the figure, and still get exactly the same shape.
Classical mixture designs, first introduced by Scheffé, 1958, take advantage of this symmetry.
They include a varying number of experimental points, depending on the purposes of the
investigation. But whatever this purpose and whatever the total number of experiments,
these points are always symmetrically distributed, so that all mixture variables play equally
important roles.
These designs thus ensure that the effects of all investigated mixture variables will be
studied with the same precision. This property is equivalent to the properties of factorial,
central composite or Box-Behnken designs for non-constrained situations.
The figure below shows two examples of classical mixture designs.
Two classical designs for three mixture components

The first design is very simple. It contains three vertices (pure mixture components), three
edge centers (binary mixtures) and only one ternary mixture or the centroid. The second
design contains more points, spanning the mixture region regularly in a triangular lattice
pattern. It contains all possible combinations (within the mixture constraint) of five levels of
each ingredient. It is similar to a five-level full factorial design - except that many
combinations, such as “25%, 25%, 25%” or “50%, 75%, 100%”, are excluded because they
are outside the simplex.
Simplex with different boundaries
This example, taken from John A. Cornell’s reference book “Experiments With Mixtures”
Cornell 1990, illustrates a how additional constraints are sometimes useful in practical
situations.
A fruit punch is to be prepared by blending three types of fruit juice: watermelon, pineapple
and orange. The purpose of the manufacturer is to use their large supplies of watermelons
by introducing watermelon juice, of little value by itself, into a blend of fruit juices.

304
Design of Experiments

Therefore, the fruit punch should contain at least 30% of watermelon juice. Pineapple and
orange have been selected as the other components of the mixture.
The manufacturer decides to use design of experiments to find the combination of fruit
juices that scores highest in a consumer preference survey. The ranges of variation selected
for the experiment are as follows:
Ranges of variation for the fruit punch design
Ingredient Low High Centroid

Watermelon 30% 100% 54%

Pineapple 0% 70% 23%

Orange 0% 70% 23%


The resulting experimental design has a number of features that makes it very different from
a factorial or central composite design.
First, the ranges of variation of the three variables are not independent. Since watermelon
has a lower level of 30%, the high level of pineapple cannot exceed 100 - 30 = 70% (in which
case the orange content would be 0%). The same holds true for orange.
The second feature concerns the levels of the three variables for the point called the
“centroid”: these levels are not halfway between “low” and “high”, they are closer to the
low level. The reason is, once again, that the blend has to add up to a total of 100%.
Since the concentrations of the ingredients cannot vary independently of each other, these
variables cannot be handled in the same way as the design variables encountered in a
factorial design. Whenever the ranges of the mixture components result in a simplex design
region, a selection of classical mixture designs are available instead. One example of a
mixture design for the optimization of Cornell’s fruit punch is shown below. It is seen that
the design region remains simplex even if the lower boundary of watermelon juice has been
increased.
Design for the optimization of fruit punch

Axial designs: Screening of mixture components


In a screening situation, the primary objective is to study the main effects of each of the
mixture components.The main effect of an input variable is the change occurring in the
response variable when the input varies from low to high, all experimental conditions being
otherwise comparable.
In a factorial design, the levels of the design variables are combined in a balanced way, so
that one can follow what happens to the response value when a particular design variable
goes from low to high. It is possible to compute the main effect of that design variable

305
The Unscrambler X Main

without regard to the remaining factors, because its low and high levels have been
combined with the same levels of all the other design variables.
In a mixture situation, this is no longer possible, as demonstrated in the previous figure.
While 30% watermelon can be combined with e.g. (70% P, 0% O) or (0% P, 70% O), 100%
watermelon can only be combined with (0% P, 0% O).
To find a solution to this problem the concept of “otherwise comparable conditions” must
be adapted to the constrained mixture situation. To screen what happens when watermelon
varies from 30% to 100%, this variation must be compensated in such a way that the mixture
still adds up to 100%, without disturbing the balance of the other mixture components. This
is achieved by moving along an axis where the proportions of the other mixture components
remain constant. In practice such mixtures are easily achieved by starting with the low level
of the component in questions while having equal proportions of the remaining
components. Subsequent addition of the first component to the mix would correspond to
moving up the axis. This is illustrated for the watermelon example in the figure below.
Studying variations in the proportion of watermelon

Mixture designs with points along the axes of the simplex are called axial designs. They are
best suited for screening purposes because they capture the main effect of each mixture
component in a simple and economical way.
An axial design in four components is represented in the next figure. It can be seen that
several points are located inside the simplex: they are mixtures of all four components. Only
the four corners, or vertices (containing the maximum concentration of an individual
component) are located on the surface of the experimental region.
A four-component axial design

Each axial point is placed halfway between the overall centroid of the simplex (25%, 25%,
25%, 25%) and a specific vertex. Thus the path leading from the centroid (“neutral”
situation) to a vertex (100% of a single component) is well described with the help of the
axial point.

306
Design of Experiments

In addition, end points can be included; they are located on the surface of the simplex,
opposite a vertex (they are marked by crosses on the figure). They contain the minimum
concentration of a specific component. When end points are included in an axial design, the
whole path leading from minimum to maximum concentration is studied. The above figure
Design for the optimization of the fruit punch composition is an example of a three-
component axial design where end points have been included.

Simplex-centroid designs: Optimization of mixtures


For the optimization of the concentrations of several mixture components, one needs a
design that enables a highly accurate prediction for any mixture - whether it involves all
components or only a subset.
Peculiar behavior may occur when the concentration of a mixture component drops down to
zero. For instance, to prepare the base for a Dijon mayonnaise, one needs to blend Dijon
mustard, egg and vegetable oil. But what happens when the egg is removed from the
recipe? The resulting dressing will have a different appearance and texture. This illustrates
the importance of interactions (e.g. between egg and oil) in mixture applications.
Thus, an optimization design for mixtures will include a large number of blends of only two,
three, or more generally, a subset of the components to be studied. The most regular design
including those sub blends is called a simplex-centroid design. It is based on the centroids of
the simplex: balanced blends of a subset of the mixture components of interest. For
instance, to optimize the concentrations of three ingredients, each of them varying between
0 and 100%, the simplex-centroid design will consist of:

 The three vertices: (100,0,0), (0,100,0) and (0,0,100);


 The three edge centers (or centroids of the two-dimensional subsimplexes defining
binary mixtures): (50,50,0), (50,0,50) and (0,50,50);
 The overall centroid: (33,33,33).

A simplex-centroid design for four variables is illustrated in the figure below.


A 4-component simplex-centroid design

In general terms, if N mixture components vary from 0 to 100%, the blends forming the
simplex-centroid design are as follows:
 The vertices are pure components;
 The second order centroids (edge centers) are binary mixtures with equal
proportions of selected two components;
 The third order centroids (face centers) are ternary mixtures with equal proportions
of selected three components;
 The Nth order centroids have equal proportions of selected N components, any
remaining components being zero.

307
The Unscrambler X Main

Note: The overall centroid is a mixture where all N components have equal
proportions.
In addition, interior points can be included in the design. They improve the precision of the
results by “anchoring” the design with additional complete mixtures (i.e. mixtures where all
components are present), and they enable computation of cubic terms. The interior points
are located halfway between the overall centroid and each vertex, and they have the same
composition as the axial points in an axial design. When a design includes interior points, it is
said to be augmented. Note that for 3 mixture components, a centroid design augmented
with axial points equals an axial design with end points included (see e.g. fruit punch
example above).

Simplex-lattice designs: Cover the mixture region evenly


Sometimes one may not be specifically interested in a screening or optimization design. One
may be doing exploratory experiments. For example, one may just want to investigate what
would happen if three ingredients that have never been mixed before were combined.
This is one of the cases where the main purpose is to cover the mixture region as evenly and
regularly as possible. Designs that address that purpose are called simplex-lattice designs.
They consist of a network of points located at regular intervals between the vertices of the
simplex. Depending on how thoroughly you want to investigate the mixture region, the
network will be more or less dense, including a varying number of intermediate levels of the
mixture components. As such, it is quite similar to an N-level full factorial design. The figure
below illustrates this similarity.
A fourth degree simplex-lattice design is similar to a five-level full factorial

Simplex-lattice designs have a wide variety of applications, depending on their degree


(number of intervals between points along the edge of the simplex). Here are a few:

 Feasibility study (degree one or two): are the blends feasible at all?
 Optimization: with a lattice of degree three or more, there are enough points to fit a
precise response surface model.
 Search for a special behavior or property which only occurs in an unknown, limited
subregion of the simplex.
 Calibration: prepare a set of blends on which several types of properties will be
measured, in order to fit a regression model to these properties. For instance, one
may wish to relate the texture of a product, as assessed by a sensory panel, to the
parameters measured by a texture analyzer. If it is known that texture is likely to
vary as a function of the composition of the blend, a simplex-lattice design is
probably the best way to generate a representative, balanced calibration data set.

D-optimal designs
A simple design subject to linear constraints

308
Design of Experiments

The following example is used to demonstrate the principles of design constraints.


A manufacturer of prepared foods wants to investigate the impact of several processing
parameters on the sensory properties of cooked, marinated meat. The meat is to be first
immersed in a marinade, then steam-cooked, and finally deep-fried. The steaming and frying
temperatures are fixed; the marinating and cooking times are the process parameters of
interest.The process engineer wants to investigate the effect of the three process variables
within the following ranges of variation:
Ranges of the process variables for the cooked meat design
Process variable Low High

Marinating time 6 hours 18 hours

Steaming time 5 min 15 min

Frying time 5 min 15 min


A full factorial design would give the following factorial (cube) experiments:
The cooked meat full factorial design
Sample Mar. Time Steam. Time Fry. Time

1 6 5 5

2 18 5 5

3 6 15 5

4 18 15 5

5 6 5 15

6 18 5 15

7 6 15 15

8 18 15 15
After carefully analyzing this table, the process engineer expresses strong doubts that
experimental design can be of any help in this situation.
“Why?” asks the statistician in charge. “Well,” replies the engineer, “if the
meat is steamed then fried for 5 minutes each it will not be cooked, and at
15 minutes each it will be overcooked and burned on the surface. In either
case, we won’t get any valid sensory ratings, because the products will be far
beyond the ranges of acceptability.”
After some discussion, the process engineer and the statistician agree that an additional
condition should be included:
“In order for the meat to be suitably cooked, the sum of the two cooking
times should remain between 16 and 24 minutes for all experiments”.
This type of restriction is called a multilinear constraint. In the current case, it can be written
in a mathematical form requiring two equations, as follows:

Steam + Fry ≥ 16 and Steam + Fry ≤ 24


The impact of these constraints on the shape of the experimental region is shown in the two
figures below:

309
The Unscrambler X Main

The cooked meat experimental region - no constraints

The cooked meat experimental region - multilinear constraints

The constrained experimental region is no longer a cube! It follows that a full factorial design
poorly explores that region.
The design that best spans the new region is given in the table below:
The cooked meat constrained design
Sample Mar. Time Steam. Time Fry. Time

1 6 5 11

2 6 5 15

3 6 9 15

4 6 11 5

5 6 15 5

6 6 15 9

7 18 5 11

8 18 5 15

9 18 9 15

10 18 11 5

11 18 15 5

12 18 15 9
This design contains all “corners” of the experimental region, in the same way as the full
factorial design does when the experimental region has the shape of a cube.

310
Design of Experiments

Depending on the number and complexity of multilinear constraints, the shape of the
experimental region can be more or less complex. In the worst cases, it may be almost
impossible to imagine! Therefore, building a design to screen or optimize variables linked by
multilinear constraints requires special methods. The following section will introduce a
special class of designs beneficial for these situations. More complex examples will be given
in the section Advanced topics for constrained situations ways to build constrained designs.
Introduction to the D-optimal principle
Those familiar with factorial designs are most likely aware that one of their most important
characteristics is their ability to study all effects independently of each other. This property,
called orthogonality, is important for relating variations in responses to variations in the
design variables. Without orthogonality, the estimated effects may become unreliable.
As soon as multilinear constraints are introduced among the design variables, it is no longer
possible to build an orthogonal design. Considering that the effect of a variable is estimated
on the premise that all other influences are held constant, it may not come as a surprise that
associations between design variables make the interpretations more difficult. In the more
severe cases of dependencies between variables, the effects will become indistinguishable
or the numerical calculations will fail. As soon as the variations in one of the design variables
are linked to those of another design variable, orthogonality cannot be achieved.
The D-optimal principle ensures that, based on a set of candidate points, the selected design
matrix has columns as close to orthogonal as possible. Mathematically, this is achieved
by maximizing the determinant of the information matrix , which is known as the D-
optimality criterion (Apostrophe meaning ‘transposed’). The volume of the joint confidence
region of the resulting regression coefficients is thereby minimized, i.e. the precision of
model parameter estimates will be maximized. An example of a design matrix could be
the cooked meat constrained design table above, including some or all of the available
design points (rows) as well as any center points or replicates. Also, any interaction or higher
order terms would be included as additional columns in .
Because the determinant of tends to increase as more experimental runs are
included in the design, the D-optimality criterion is not well suited for comparing designs of
different sizes. The related D-efficiency is independent of the number of runs.

Here, n is the number of experimental runs and p is the number of model terms. The D-
efficiency ranges from 0 to 100%, where a factorial design without centerpoints has a D-
efficiency of 100%. While a large design will tend to have a larger value of and yield
a smaller confidence region for the parameters, the average point precision as estimated by
the D-efficiency will be comparable for differently sized designs.
Candidate design points
A point exchange algorithm is used to find the D-optimal design points in The Unscrambler®.
These points may optionally be augmented with a number of space filling points to ensure
good coverage also inside the experimental region. Both these procedures require a set of
candidate points as input. These points are set up in such a manner that they span the
maximum allowed design region as well as the interior region. The candidate points are
All extreme vertices. These are the outer corners of the design region:
The extreme vertices of a square design region

311
The Unscrambler X Main

All edge centers. These are defined as the midpoint between any two vertices constituting
an outer edge of the design region:
The edge centers of a square design region

All face centers. These are defined as the center point on any outer surface of the design
region as spanned by three or more edges:
The face centers of a square design region

The overall centroid. This is the center point of the design. For a design with two design
variables only the overall centroid overlaps with the single face center.
All axial check blends. These are defined as the midpoint on any axis spanned by the overall
centroid and the extreme vertices. These do not improve the coverage of the outer design
region but can be very useful space filling points for more robust models:
The axial check blends of a square design region

Point exchange algorithm


A D-optimal design containing a specified number of D-optimal points are found based on
the Fast Fedorov Exchange Algorithm (FFEA) Nguyen and Piepel, 2005. Partially random
starting designs are used in which a smaller subset of points is selected randomly, and then
points are added one by one to maximize the D-efficiency. When the pre-specified number
of design points have been included the design is optimized using the FFEA. The best D-
optimal design is finally selected from several such partially random starts. This ensures that
a good design is found that is less likely to result from a local maximum.
The points are selected from the candidate list without replacement. This means that the
algorithm itself will never return replicates of the selected points, and the maximum number

312
Design of Experiments

of points is bounded by the number of candidate points in each case. The number of
additional center points (overall centroids) as well as the number of replicates for the entire
design is specified separately. This enables a higher level of user control over the
replications, and it favours a better spread of points over the design region compared to
selection with replacement. On the other hand the D-efficiency of the resulting design may
be slightly lower than if replication had been allowed. For practical use we believe the
benefits of a good spread in design points far outweight a small reduction in D-efficiency
(see next section).
Addition of space filling points
The list of D-optimal points returned from the FFEA is optionally used as a starting point for
a subsequent Kennard-Stone selection process Kennard and Stone, 1969. During this
process, the design is augmented with a specified number of space filling points in order to
span the entire design region as evenly as possible. These points are taken from the
remaining candidate list, i.e. the selection is based on candidate points that have not already
been selected in the point exchange algorithm.
While D-optimal designs provide precise model terms and good predictions of training data,
they tend to focus on the outer regions of the design space. It has been shown that designs
with samples spread evenly across the entire design region tend to be more robust in many
cases Naes and Isaksson, 1989. Inclusion of space filling points by Kennard-Stone enables
better modeling of the interior design region and may therefore give more accurate
response surfaces and stable predictions when applying the model on new data. Also space
filling points tend to make the design less dependent on which model terms are included.
This is beneficial because the exact model equation is usually not known in advance.
The condition number (C.N.)
In order to minimize the negative consequences of a deviation from the ideal orthogonal
case, one needs a measure of the “lack of orthogonality” of a design. This measure is
provided by the condition number (C.N.) Golub, 1996:
C.N. = largest eigenvalue / smallest eigenvalue of the matrix
It indicates the degree of multicollinearity in the design matrix as follows:
 C.N. = 1: no multicollinearity, i.e. orthogonal
 C.N. < 100: multicollinearity not a serious problem
 100 < C.N. < 1000: moderate to severe multicollinearity
 C.N. > 1000 severe multicollinearity
It is also linked to the elongation or degree of “non-sphericity” of the region actually
explored by the design. The smaller the condition number, the more spherical the region,
and the closer a design is to being orthogonal.
Another important property of an experimental design is its ability to explore the whole
region spanned by the design variables. It can be shown that once the shape of the
experimental region has been determined by the constraints, the design with the smallest
condition number is the one that encloses maximal volume. It follows that if all extreme
vertices are included in the design, it has the smallest attainable condition number. If that
solution is too expensive, however, one needs to select a smaller number of points. The
consequence is that the condition number will increase and the enclosed volume will
decrease.
How good is the calculated design?
The condition number of an orthogonal design such as a non-modified factorial design is
exactly 1. Such a design has optimal properties in terms of interpretation, mathematical
robustness and economical considerations. The condition number of a non-orthogonal
(constrained) design will always be larger than one, and the larger the deviation, the less

313
The Unscrambler X Main

favorable is the design. In general, caution should be exercised when analyzing a non-
orthogonal design using classical DoE Analysis(ANOVA/MLR). The Unscrambler® suggests
analysis by Partial Least Squares Regression for D-optimal designs, ascorrelated effects are
handled much better by this method and misinterpretations will be rare.
If the design has a condition number much larger than, say, 100, this is an indication that the
experimental region is heavily constrained. In such a case either of several design factors
may have influence on the response, but it is impossible to find out which (ANOVA might
suggest one of them arbitrarily, PLSR will correctly reveal that both are correlated with the
response). This may occur when there is insufficient individual variation in the design levels
compared to the noise level of the experiment. To ensure sufficient orthogonal variation for
each effect, it is recommended that all of the design variables and constraints be critically re-
examined. One should search for ways to simplify the problem see the section on Advanced
Topics for Constrained Situations, otherwise there is the risk of starting an expensive series
of experiments which will not give any useful information.

Designs with simple linear constraints


We will use the the marinated meat example above to illustrate a design with multilinear
constraints. For simplification, we can focus on the “Steaming time” and “Frying time” and
take into account only one constraint:

Steaming time + Frying time ≤ 24.


The figure below shows the impact of the constraint on the variations of the two design
variables.
The constraint cuts off one corner of the “cube”

A full factorial design applied to this situation would result in a sub-optimal solution that left
one half of the experimental region unexplored (i.e. the triangle spanned by the remaining 3
points). So where should we place the 4th point in order to span the experimental region as
well as possible?
We could imagine two candidate points where the dashed line of the linear constraint
crosses the factorial design region in the above figure. Two alternative solutions for selecting
4 design points are illustrated below.
Designs with four points leaving out a portion of the experimental region

314
Design of Experiments

Design II in the figure seems to be a better option than design I, because the excluded region
is smaller. A design using points (1, 3, 4, 5) would be equivalent to (I), and a design using
points (1, 2, 4, 5) would be equivalent to (II). The worst solution of all would be a design with
points (2, 3, 4, 5): this would leave out the whole corner defined by points 1, 2 and 5.
It follows that if the whole experimental region was to be explored, more than four points
would be needed. The above example shows that a minimum of five points (1, 2, 3, 4, 5) are
necessary. These five crucial points are the extreme vertices of the constrained experimental
region. They have the following property: if a sheet of paper was wrapped around those
points, the shape of the experimental region would appear, revealed by the wrapping.
If there are more than two design variables or multiple constraints it might not be straight
forward to find the best set of design points. The D-optimal criterion is commonly used to
find the best design in these situations.

Non-simplex mixture designs


D-optimal designs may also be used for analyzing mixtures. This is useful if there are upper
constraints on some of the mixture components such that the design region is non-simplex
(refer to the section, Is the Mixture Region a Simplex?). While the regular mixture designs
cannot handle these cases, a D-optimal design can be used by including a constraint that all
mixture components should sum to 100%. Additional upper or lower levels on any of the
mixture components will then have to be added as separate multilinear constraints.
Note: Classical mixture designs have much better properties than D-optimal
designs. Remember this before establishing additional constraints on mixture
components.

Process/mixture designs
Sometimes the product properties of interest depend on a combination of a mixture recipe
with specific process settings. In such cases, it is useful to investigate mixture and process
variables together. The process variables and the mixture variables are then combined using
the pattern of subfactorial designs and a D-optimal design can be generated.

8.2.8 Types of samples in experimental design


This section presents an overview of the various types of samples to be found in
experimental designs, along with their properties.
Factorial (cube) samples

315
The Unscrambler X Main

Factorial samples can be found in factorial designs and their extensions. They are a
combination of high and low levels of the design variables in experimental plans based on
two levels of each variable. This forms a square for 2 variables or a (multidimensional) cube
for 3 (or more) variables. These samples are therefore sometimes referred to as cube
samples.
The same factorial design points are also found among other samples in central composite
designs. In Box-Behnken designs, all samples found on the factorial cube are also called
factorial samples (even though these design points are positioned on the edges rather than
the vertices of the cube).
All combinations of levels of the design variables in N-level full factorials are also called
factorial samples.
Center samples
Center samples are samples for which each design variable is set at its mid-level. When all
variables are continuous, the center points are located at the exact center of the
experimental region.
Center samples are not defined for categorical factors. When there is a combination of
continuous and category variables in the design, center points corresponding to the mid-
level of all continuous factors can be added for each unique combination of levels for up to 4
category variables.
For instance, if the number of two-level category variables in the design is (1, 2, 3, 4), this
results in (2, 4, 8, 16) single replicate center points, respectively. If two replicates of center
points are required, this doubles the total number of center points in the design. If we have
a three variable full factorial design with two two-level categorical variables, there are four
unique center points corresponding to the different level combinations of the categorical
factors. If 2 replicates of the center points are required, this results in 8 center points in
total.
The higher number of levels for the categorical variables and the more replication required,
the number of center points can grow large very quickly. It is suggested that when either the
number or levels of categorical variables becomes larger than 2, design replication may be a
better option.
Center samples in screening designs. In screening designs, center samples are used for
curvature checking: Since the underlying model in such a design assumes that all main
effects are linear, it is useful to have at least one design point with an intermediate level for
all factors. Thus, when all experiments have been performed, one can check whether the
intermediate value of the response fits with the global linear pattern, or whether there are
signs of deviation from the straight line fit.
In the case of high curvature, one will have to build a new design which accepts a quadratic
model. The Unscrambler® provides an option to calculate curvature in a design when all
variables are continuous and at least one center point is present.
If at least 2 center samples are present (preferably 3), the model will also be tested for lack
of fit (LOF). This is a test comparing the variation of the measured responses within center
samples with the overall variation between measured and fitted (i.e. predicted) response
values. A significant LOF indicates that the model might benefit from additional terms.
In screening designs, center samples are optional; however, it is recommended that at least
three are included if possible. See the section on replicates for more details.
Center samples in optimization designs. In optimization designs, center samples are
important also for fitting higher order models. It is therefore recommended that 5 or more
are included in the design. In particular for Box-Behnken designs, ample center samples are
needed to fit a precise response surface.

316
Design of Experiments

Axial (Star) samples


Axial samples are used in Central Composite designs. Their coordinates often exceeds the
low or high levels defined for the variable in question, while all other variables are at the
mid-level. The additional levels are beneficial for fitting a quadratic or cubic model to the
data.
Axial samples in a Central Composite design with two design variables

Axial samples can lie on centers of cube faces or they can lie outside the cube, at a given
distance from the center of the cube. This distance can be tuned, but it is recommended to
use the default distance (for the given design) whenever possible.
Three cases can be considered:
 The default axial to center point distance ensures that all design samples have
exactly the same leverage, i.e. the same influence on the model. Such a design is
said to be “rotatable”. If the number of design variables is two or four, this distance
also ensures that all factorial and design points lie with the same distance from the
center, giving a “spherical” design region. For other numbers of factors, rotatability
almost, but not quite, corresponds with a spherical design;
 The axial to center point distance can be tuned down to 1. In that case, the star
samples will be located at the centers of the faces of the cube. This ensures that a
Central Composite design can be built even if levels lower than “low cube” or higher
than “high cube” are impossible. However, the design is no longer rotatable;
 Any intermediate value for the star distance to center is also possible. The design
will not be rotatable.
Sample types in mixture designs
An overview of the various sample types used in mixture designs is provided below:

 Axial design: vertex and axial samples, optionally end points and overall centroids;
 Simplex-centroid design: vertex samples, centroids of various orders, optional
interior (axial) points;
 Simplex-lattice designs: samples positioned in a regular grid (similar to multi-level
factorial samples), overall centroid.

Each type is of point is described in more detail as follows.


Axial point
In a simplex design, an Axial point is positioned on the axis of one of the mixture
variables, half-way between the overall centroid and the vertex for that component.
Used in Axial designs and augmented Simplex-Centroid designs.
Centroid point
A Centroid point is calculated as the mean of the extreme vertices on a given
surface. Edge centers, Face centers and Overall Centroids are all examples of

317
The Unscrambler X Main

centroid points. The number of mixture components involved in the centroid is


called the centroid order. For instance, in a four-component mixture, the overall
centroid is the fourth order centroid. Edge centers, or second order centroids, are
positioned in the center of the edges of the simplex. In Unscrambler the overall
centroid is denoted ‘Centroid’ while lower order centroids are referred to as ‘Blend’
points in Simplex-Centroid designs.
End point
In an axial design, ‘End’ points are optionally positioned at the bottom of the axis of
one of the mixture variables, and is thus on the opposite side to the axial point.
These are second order centroids and are referred to as Blend points in Simplex-
Centroid designs.
Face center
The face centers are positioned in the center of the faces of a simplex. They are also
referred to as third order centroids.
Interior point
An interior point is not located on the surface of a design, but inside the
experimental region. For example, an axial point is a particular kind of interior point.
Overall centroid
The overall centroid is calculated as the mean of all extreme vertices. It is the
mixture equivalent of a center sample.
Vertex sample
A vertex is a point where two lines meet to form an angle. Vertex samples are the
“corners” of the simplex corresponding to pure components.
Reference samples
Reference samples do not belong to a standard design, but are included for various
purposes.
Here are a few classical cases where reference samples are often used:

 When trying to improve an existing product or process, the current recipe or process
settings may be used as a reference.
 When trying to copy an existing product, for which the recipe is not known, one
might still include that product as reference and measure the responses on that
sample as well as on the others, in order to know how close the experimental
samples have come to that product.
 To check curvature in the case where some of the design variables are category
variables, one can include one reference sample with center levels of all continuous
variables for each level (or combination of levels) of the category variable(s).

Note: For reference samples, only response values can be taken automatically into
account in the Analysis of Effects and Response Surface analyzes. Values of the
design variables may, however, be entered manually after converting to a non-
designed data table, then run a PLS analysis on the resulting table.
Replicates
Replicates are experiments performed several times under reproduced conditions. They
should not be confused with repeated measurements, where the samples are only prepared
once but the measurements are performed several times on each.
Why include replicates?

318
Design of Experiments

Replicates are included in a design in order to estimate the experimental error associated
with the system. This is doubly useful as it:

 Gives information about the average experimental error in itself;


 Enables a comparison of the response variation due to controlled causes (i.e. due to
variation in the design variables) with uncontrolled response variation. If the
“explainable” variation in a response is no larger than its random variation, the
variations of this response cannot be related to the investigated design variables.

How to include replicates


The usual strategy is to specify several replicates of the center sample. This has the
advantage of both being rather economical, and providing an estimation of the experimental
error under “average” conditions.
When no center sample can be defined (because the design includes category variables only
or variables with more than two levels), one may repeat the entire set of experimental
points instead. This also provides a better estimation of the experimental error across the
design region. If it is known that there is a lot of uncontrolled or unexplained variability in
the experiments, it might be wise to replicate the whole design.

8.2.9 Sample order in a design


The purpose of experimental design usually is to find out how variations in design variables
influence response variations. However, no matter how well the conditions of an
experimental setup is controlled, random variations still occur. The next sections describe
what can be done to limit the effect of random variations on the interpretation of the final
results.
Randomization
Randomization means that the experiments are performed in random order, as opposed to
the standard order which is sorted according to the levels of the design variables.
Most often, the experimental conditions are likely “drift” during the course of the
investigation, such as when temperature and humidity vary according to external
meteorological conditions, or when the experiments are carried out by a new employee who
is better trained at the end of the investigation than at the beginning. It is crucial not to risk
confusing the effect of a change over time with the effect of one of the investigated
variables. To avoid such misinterpretation, the order in which the experimental runs are to
be performed is usually randomized.
Incomplete randomization
There may be circumstances which prevent the use of full randomization. For instance, one
of the design variables may be a parameter that is particularly difficult to tune, so that the
experiments will be performed much more efficiently if that parameter only needs to be
tuned a few times. Another case for incomplete randomization is blocking.
The Unscrambler® enables one to leave some variables out of the randomization. As a result,
the experimental runs will be sorted according to the non-randomized variable(s). This will
generate groups of samples with a constant value for those variables. Within these groups,
the samples will be randomized according to the remaining variables.

8.2.10 Blocking
In some situations it may not be possible to run all experiments under the exact same
conditions, or there may be other reasons to split the full set of runs into blocks that are

319
The Unscrambler X Main

performed independently from the others in some sense. A common scenario is that raw
material comes from different batches, in case there is not enough material in a single batch
to accommodate the full set of experiments. Often screening designs are extended into
factor influence studies, or factor influence studies are extended into optimization studies. If
this is performed in a planned manner, it will often be possible to re-use previous
measurements and supplement them with new ones. For instance, a low resolution
fractional factorial can be extended into a high resolution or full factorial design, which again
can be extended into a circumscribed or faced central composite design (see section
Extending a design below). Because these blocks of experiments are necessarily performed
in different points of time, there is a higher risk that non-controllable or unknown factors
differ between blocks. Whether such variation has an unwanted effect on the response
should always be investigated.
Any blocked experiment should be tested for unequal block means. For experiments where
measurements are divided into two distinct blocks, the response(s) can be tested using a
Student’s t-test for equality of means. A low p-value, or equivalently a large difference
between the plotted quantiles, indicates that there is a significant blocking effect. Any effect
confounded with blocks cannot be trusted if this is the case. Careful planning of the
experiment is required to avoid that effects of interest are confounded with, or non-
distinguishable from, blocks.
For any number of blocks the responses can be plotted in a quantiles plot, where the block
means and variances can be compared using the sample grouping option. If the distributions
of response values are similar across blocks, there is no evidence that block effects have had
an influence on the response.
Incomplete blocking of full factorial designs
If the full experiment is replicated, one should strive to include the full set of unique design
points in each block. This will ensure that any blocking effect is confounded with replicates
only, and all effects will be free of confounding with blocks. When all the treatment
combinations are included in each block, the design is referred to as a complete block design
and block effects should be tested as described above.
If this is not possible some effects will always be confounded with blocks, and the estimated
effects in question will include the block contribution as well. This is referred to as an
incomplete block design, and the efficiency of such a design depends on which effects are
confounded with blocks. Of course one would not want to create a design where any of the
main effects were confounded with blocks, as these main effects would be indistinguishable
from the block effects. Preferably the blocks should be set up such that they are confounded
with high order interactions only.
The Unscrambler® supports blocking of most full factorial experiments into 2p blocks, p being
smaller than the number of design variables. A full factorial design with three 2-level factors
may be divided into two or four blocks. A full factorial design with 3-7 2-level factors may be
split into two, four or eight blocks. The blocking generators are selected to ensure that as
many low-order interactions as possible can be estimated without confounding with blocks.
For instance, in a six-variable design divided into two blocks, the blocking effect will be
confounded with the six-variable interaction only.
In the ANOVA, all interactions confounded with blocks will be summarized in a separate
sums of squares for blocks. These individual interaction effects will not be given or tested in
the ANOVA, as they are indistinguishable from the blocking effects.

320
Design of Experiments

8.2.11 Extending a design


After a series of designed experiments has been performed, the are results analyzed and
conclusion are drawn from them, two situations may occur:

 The experiments have provided all the information needed, which means that the
project is completed.
 The experiments have given valuable information which can be used to build a new
series of experiments that will lead closer to the experimental objective.

In the latter case, the new series of experiments can sometimes be designed as a
complement to, or an extension of, the previous design. This allows one to minimize the
number of new experimental runs, and the whole set of results from the two series of runs
can be analyzed together.
Why extend a design?
In principle, one should make use of the extension feature whenever possible, because it
enables progression to the next stage of an investigation using a minimum of additional
experimental runs.
Extending an existing design is also a convenient way of building a new, similar design that
can be analyzed together with the original one. For example, if a chemical reaction has been
investigated using a specific type of catalyst, one might want to investigate another type of
catalyst under the same conditions as the first reaction, in order to compare their
performances. This can be achieved by adding a new design variable, namely type of
catalyst, to the existing design.
Design extensions can also be used as a basis for an efficient sequential experimental
strategy. That strategy consists in breaking the initial problem into a series of smaller,
intermediate problems and investing in a small number of experiments to achieve each of
the intermediate objectives. Thus, if something goes wrong at one stage, the losses are cut;
and if all goes well, one may end up solving the initial problem at a lower cost than if a huge
design had been used initially.
When and how to extend a design
The following text briefly describes the most common extension cases:

 Add levels: Used whenever one is interested in investigating more levels of already
included design variables, especially for category variables.
 Add a design variable: Used whenever a parameter that has been kept constant is
suspected to have a potential influence on the responses, as well as when one
wishes to duplicate an existing design in order to apply it to new conditions that
differ by the values of one specific variable (continuous or category), and analyze the
results together. For instance, if a chemical reaction using a specific catalyst has
been investigated, and now another similar catalyst for the same reaction will be
studied to compare its performances to the other one’s, the first design can be
extended by adding a new variable; type of catalyst.
 Delete a design variable: If the analysis of effects has established one or a few of the
variables in the original design to be clearly insignificant, the power of the
conclusions can be be increased by deleting this variable(s) and reanalyzing the
design. Deleting a design variable can also be a first step before extending a
screening design into an optimization design. This option should be exercised with

321
The Unscrambler X Main

caution if the effect of the removed variable is close to significance. Also be sure
that the variable to be removed does not participate in any significant interactions.
 Add more replicates: If the first series of experiments shows that the experimental
error is unexpectedly high, replicating all experiments might make the results
clearer.
 Add more center samples: In order to get a better estimation of the experimental
error, adding a few center samples is a good and inexpensive solution.
 Add more reference samples whenever new references are of interest. More
replicates of existing reference samples may be used in order to get a better
estimation of the experimental error.
 Extend to higher resolution: Use this option for fractional factorial designs where
some of the effects of interest are confounded with each other. This option can be
used whenever some of the confounded interactions are significant and one needs
to find out exactly which ones. This is only possible if there is a higher resolution
fractional factorial design available. Otherwise, one can extend to a full factorial
design instead.
 Extend to full factorial: This applies to fractional factorial designs where some of the
effects of interest are confounded with each other and no higher resolution
fractional factorial designs are available.
 Extend to central composite: This option completes a full factorial design by adding
star samples and (optionally) a few more center samples. Fractional factorial designs
can also be completed this way, by adding the necessary cube samples as well. This
should be used only when the number of design variables is small; an intermediate
step may be to delete a few variables first.

Caution! Whenever extending a design, remember that all the experimental


conditions not represented in the design variables must be the same for the new
experimental runs as for the previous runs.
How to ensure representative new samples
As the new experiments will be exploring a new area of the design space, it is important to
be sure that there has been no drift since the first experiments have been performed.
To do so try to use at least two or three new center samples. Once the experiments are
performed run a T-test to compare the average of the first series of center samples and the
second. See the section on T-test (Introduction to statistical tests) or blocking for more
details.

8.2.12 Building an efficient experimental strategy


How should experimental design be used in practice? Is it more efficient to build one global
design that tries to achieve the main goal, or would it be better to break it down into a
sequence of more modest objectives, each with its own design?
It is strongly advised that even if the initial number of design variables to be investigated is
rather small, use the latter, sequential approach. This has at least four advantages:

 Each step of the strategy consists of a design involving a reasonably small number of
experiments. Thus, the mere size of each subproject is more easily manageable.
 A smaller number of experiments also means that the underlying conditions can
more easily be kept constant for the whole design, which will make the effects of
the design variables appear more clearly.

322
Design of Experiments

 If something goes wrong at a given step, the damage is restricted to that particular
step.
 If all goes well, the global cost is usually smaller than with one huge design, and the
final objective is achieved all the same.

Example of an experimental strategy


The following example illustrates an example experimental strategy. The objective is to
optimize a process that relies on six parameters: A, B, C, D, E, F. As it is not known which of
these parameters are influential, one must start at the screening stage.
The most straightforward approach would be to try an optimization at once, by building a
CCD with six design variables. It is possible, but costly (with at least 77 samples required) and
is also a risky approach (consider the impact if a wrong initial assumption was made, like a
wrong choice of ranges of variation? All experiments may be lost).
An alternative approach is described below:

 First, build a fractional factorial design 2(6-2) (resolution IV), with two center samples,
and perform the corresponding 18 experiments.
 After analyzing the results, it turns out that only variables A, B, C and E have
significant main effects and/or interactions. But those interactions are confounded,
so the design needs to be extended in order to know which are really significant.
 The first design is extended by deleting variables D and F and extending the
remaining part (which is now a 2(4-1), resolution IV design) to a full factorial design
with one more center sample. Additional cost: nine experiments.
 After analyzing the new design, the significant interactions which are not
confounded only involve A, B and C. The effect of E is clear and goes in the same
direction for all responses. But since the center samples show some curvature, one
must proceed to the optimization stage for the remaining variables.
 Thus, variable E is kept constant at its most interesting level, and after deleting that
variable from the design, the remaining 2³ full factorial design is extended to a CCD
with six center samples. Additional cost: nine experiments.
 Analysis of the final results yielded a desired optimum point. Final cost: 18+9+9=36
experiments, which is less than half of the initial estimate.

8.2.13 Analyze results from designed experiments


Simple data checks and graphical analysis
Any data analysis should start with simple data checks: use descriptive statistics, check
variable distributions, detect out-of-range values, etc.
For designed data, this is particularly important: one would not want to base a test of the
significance of the effects on erroneous data!
The good news is that data checks are even easier to perform when experimental design has
been used to generate the data. The reason for this is twofold:

 If the design variables have any effect at all, the experimental design structure
should be reflected in some way or other in the response data; graphical analysis
and PCA will visualize this structure and help one detect abnormal features.
 The Unscrambler® includes automatic features that take advantage of the design
structure (grouping according to levels of design variables when computing

323
The Unscrambler X Main

descriptive statistics or viewing a PCA scores plot). When the structure of the design
shows in the plots (e.g. as subgroups in a box-plot, or with different colors on a
scores plot), it is easy to spot any sample or variable with an illogical behavior.

Analysis Of Variance (ANOVA)


The ANOVA table is a powerful tool to assess how well the model fits individual responses. It
has a Summary section that provides information about the overall significance of the
model. This is followed by a Variables section providing information about the importance of
the different design variables and their interactions. A Model Check section divides the total
variance into variability explained by terms of different order. For factorial and lower order
CCD models, all effects are orthogonal, meaning that e.g. the effect of linear terms equals
the sum of individual contributions.
Mixture designs are not orthogonal, and variances are therefore no longer additive. For
these designs, the Variables section provides the so-called marginal (Type III) sums of
squares (SS), reflecting the difference in SS between the full model and a model with the
effect in question left out. In contrast, the model check section provides the sequential
(Type I) SS, reflecting the increase in model SS when higher order terms are added to the
design. The model check section can be used to decide the optimal complexity of the
mixture model. Higher order terms should not be included unless they contribute
significantly to the model fit.
There is a Lack of Fit section that compares the experimental uncertainty (pure error) with
the residual variability due to inadequate modeling of the data (lack of fit). The pure error is
estimated based on replicated measurements of center samples. A significant lack of fit is an
indication that additional terms may improve the model. At the bottom of the ANOVA table,
there is a section with different model quality estimates such as calibration and prediction
R², prediction error sums of squares (PRESS), etc. The PRESS value reflects the error variance
when each observation is left out from the calibration model once and subsequently
predicted. It reflects the predictive ability of the model and is therefore a conservative
estimate of how good the model is. A PRESS value close to (or higher than) the corrected
total SS means very low predictive ability and will give an ‘R-square prediction’ value close to
zero (or negative). R-square prediction closer to 1.0 means that the predictive ability is good
and the PRESS value is correspondingly small.
The analysis sequence is then to first look at the model p-value and R². A p-value below 5%
indicates a good model fit and a R² close to 1 indicates a good correlation between the
predicted response value and the actual response value. Consideration must then be given
to the value of the individual effects or model terms and their sign. Consideration should
also be given to the corresponding p-values. Each effect with a p-value < 5% is considered
significant; if the p-value is < 1% it is highly significant. A p-value between 5 and 10%
indicates a marginally significant effect. A p-value > 10% indicates that an effect is not
considered to be significant.
ANOVA table
Sum of Squares Degree of Freedom Mean p-
F-ratio
(SS) (DF) Square value

Summary

Model 1.750 e+03 3 583.333 194.444 0.0001

Error 12 4 3

324
Design of Experiments

Sum of Squares Degree of Freedom Mean p-


F-ratio
(SS) (DF) Square value

Total 1.762 e+03 7 251.714

Variables

A 50.000 1 50.000 16.667 0.0151

B 1.250 e+03 1 1.250 e+03 416.667 0.0000

AB 450.000 1 450.000 150.000 0.0003


In this example the model is valid (p-value=0.0001) and all effects are significant (p-values <
0.05). The most significant effect is B as it has the smallest p-value.
Note: A saturated design is a design in which the number of experimental runs
equals to the number of model terms (including offset if necessary). This type of
design uses all the degrees of freedom to calculate the model terms, the error SS is
zero and p-values will not available.

Checking the adequacy of the model


Some assumptions underlying the ANOVA need to be verified before the test results can be
fully trusted. The first assumption is that the observations are adequately described by the
model. The model is defined by the included effects, and the best way to validate the model
is to apply it on left-out observations and see how well the predicted and measured
responses correspond with each other. A low PRESS value, or correspondingly an ‘R-square
prediction’ close to one, is an indication that the first assumption holds.
Also, the errors should be normally and independently distributed with mean zero and
constant but unknown variance. An important step of the analysis is therefore to plot the
residuals in different representations. In short, no obvious structures or patterns should be
found in the residuals when these assumptions are met.
The normality assumption is checked by looking at the residual histogram or normal
probability plot. The first should ideally look like the bell-shaped probability density of the
normal distribution centered at zero. Samples displaying strong deviation from the normal
distribution will be detected as deviating from a straight line in the normal probability plot of
residuals. This plot can therefore also be used as an outlier detection tool. Note that if the
number of observations is small, even perfectly random residuals will deviate somewhat
from the ideal bell-shaped density function. Luckily, the significance tests are robust to
moderate departures from normality.
The independence assumption can be verified by plotting the Y-residuals in experimental
order. The reason for randomizing the experimental order of runs is to avoid that time
dependent variations are influencing the estimation of effects. Correlation between
residuals, however, indicates that the runs have not been independently measured, which
may seriously affect the validity of the results. Also the Y-residuals vs. Y-predicted plot
should be studied to see whether any obvious patterns are found. Independent residuals will
appear as random variations in these plots.
Both the Y-residuals in experimental order and the Y-residuals vs. Y-predicted plots can also
be studied to check the constant variance assumption. Use these plots to see whether the
spread of observations is larger in one end compared to the other. A funnel or cone shape of
the experimental points indicates that some measurements are more precise than others, or
equivalently that some measurements have a larger influence on the model than others. If

325
The Unscrambler X Main

the variance is strongly associated with the magnitude of the response, a variance-stabilizing
transform such as log(Y), Y1/2, or 1/Y might be considered (Tip: Histograms can be used to
test the influence on the response of different transforms). If the precision of runs improves
somewhat in the course of the experiment, a model based on randomized runs will most
likely be robust to these changes.
Note that if there are very few residual degrees of freedom left after estimating all the
effects in the model, artificial structure in the residuals can be expected simply due to lack of
information in the data. In the extreme case that the residual degrees of freedom is zero, all
the residuals will be zero as well. If a little more than the minimum number of experiments
can be afforded, this will benefit the interpretation of results.
Analysis of effects using classical methods
An analysis of the effects is usually performed for screening and factor influence designs:
Plackett-Burman, Fractional Factorial, Full-Factorial designs. These designs allow estimation
of main effects and some of them also 2-3 variable interactions.
The classical DoE analysis method for studying effects is based on the ANOVA-table. Main
effects or interactions found to be important in the ANOVA table can be investigated further
in an effects visualization plot. This will reveal the direction and magnitude of the individual
effects. It is important to note that even if a main effect seems to be irrelevant, the factor
can still have a large impact on the model if it takes part in a significant interaction effect.
Other checks that can be applied after analyzing the ANOVA table include the detection of
curvature effects. These can be found by plotting the main effects plot. If a nonlinear trend
is detected when checking the position of the center sample, one may consider a possible
curvature effect and include the square term of the effect in the model.
Main effect plot with curvature

When a variable is categorical, it is necessary to check which effects are significant and also
if they are significantly different. The multiple comparison test provides this type of
information. It is based on a comparison of the averages of the response variable at the
different levels. If the difference between two averages is greater than the critical limit the
two levels are significantly different. If not they have a similar effect. If no level has an effect
all levels will have a statistically similar effect, and the averages for the response variables at
the different levels will be non-significantly different.

326
Design of Experiments

In The Unscrambler®, there are three specific outputs for the multiple comparison test:

 A table of distances, that gives the two-by-two distance between the levels.
 A group table, that indicates the different grouping between the levels.
 A plot displaying the levels in their group.

More information in the plot (Interpreting design analysis plots) section.


Response surface analysis using classical methods
A response surface analysis is very useful when the experimental objective is optimization.
This is often the case for Central Composite and Box-Behnken designs as well as Mixture
designs.
The classical DoE method of analysis for studying a response surface is to fit a quadratic (or
even a cubic) model by MLR. For mixture designs, a special type of MLR models called
Scheffé models are used, which do not include an offset parameter.
The ANOVA table is still the main tool to assess the significance of effects. The significance of
individual effects as well as two-variable and three-variable interactions, square and cubic
terms must be assessed, depending on the terms included in the analysis.
The available models for BB designs are:

 Main effects
 Main effects + interactions (2-variable)
 Main effects + interactions (2-variable) + quadratic terms

The available models for CCD designs are:

 Main effects + interactions (2-variable) + quadratic terms


 Main effects + interactions (2-variable) + quadratic + cubic terms
 Main effects + interactions (2- and 3-variable) + quadratic terms
 Main effects + interactions (2- and 3-variable) + quadratic + cubic terms

The models for mixture designs are:

 First order (linear),


 Second order (quadratic),
 Special cubic. This is similar to main effects + interactions (2- and 3-variable).
However as the model has a closure constraint quadratic terms are partially
included.
 Full cubic. This is similar to main effects + interactions (2- and 3-variable) + quadratic
terms.

The above lists correspond with pre-defined alternatives, and it is possible to remove terms
from any of these models in a hierarchical manner (except linear mixture terms, which
cannot be removed).
The response surface can be used to find optimal design settings. For CCD and BB designs,
one fitted response are plotted for the entire area spanned by two design variables, any
remaining variables held constant at its minimum level. Maxima, minima, saddle points or
stable regions can be detected by changing which variables to plot while varying the levels of

327
The Unscrambler X Main

the remaining variables. For mixture designs, the plotted design region consists of three
mixture components forming a simplex/triangle.
More information on how to vary the condition can be found in the RS table section in the
plot interpretation page.
Response surface

Limitations of ANOVA
Analyses based on MLR/ANOVA are very useful for orthogonal designs or mixture designs
where one or two (non-related) responses have been measured accurately following the
experimental conditions. ANOVA has some important shortcomings, however:
 The underlying MLR is based on the assumption that all variables can be measured
independently of all other variables in the model. This is always the case for
orthogonal designs such as the factorial designs. For some designs, such as
optimization designs including quadratic terms, mixture designs, D-optimal designs
or for any design where some experimental measurements are missing, some of the
model terms (effects) will become more or less correlated. If two correlated terms
both have an influence on the response, one of these will often (arbitrarily) come
out as significant at the expense of the other. While the ANOVA will automatically
handle standard designs such as mixture designs of simplex shape, a bilinear method
such as PLSR can take into account any number of correlated variables.
 If several responses are modeled, the MLR will fit a model to each response
independently. If all responses are orthogonal, one can then assess the ANOVA table
for each response without taking the remaining responses into account. The
problem is that real data are seldom or never orthogonal. For any two sufficiently
correlated responses, it is sub-optimal to try to assess the effects on one
independently from the other, and trying to find the main conclusions from several
ANOVA tables together is difficult in itself. A bilinear method such as PLSR can take
into account any number of correlated responses, and any relationships between
responses and descriptors will be easily detected.
 The reliability of the p-value estimates in the ANOVA table highly depends on the
residual degrees of freedom (DF) in the data after estimating all the parameters of

328
Design of Experiments

the model. If the error DF is low, the reliability of the estimated p-values is low as
well. This also limits the ability to check the assumptions of the model. When
several, correlated effects are estimated, the MLR consumes more DF than the true
number of underlying, independent effects. In contrast, with the bilinear methods
such as PLSR, the user estimates the optimal model rank based on the predictive
ability of the model.
 In the ANOVA table, the predictive ability of the model is given by the ‘PRESS’ and
‘R-square prediction’ values. These are based on leverage corrected residuals, which
in the case of MLR is identical to residuals obtained from a leave-one-out (LOO)
cross-validation. This reflects the ability of the model to predict each measurement
based on models fitted using all samples except the one in question. If some
samples are replicated, the LOO procedure will be overly optimistic. If there are for
instance 3 center samples in total, these will be predicted based on models where
the 2 remaining center samples have been accounted for. The prediction error will
therefore be smaller than if all center samples were kept out in the same step. In
general, all replicated measurements of any experimental point should be kept out
in a single cross-validation segment to ensure conservative error estimates.
 Non-controllable variables, i.e. variables that are believed to have an effect on the
responses but that are difficult to control at the required level of precision, are
currently not included in the ANOVA. In general, an attempt to include many of
these variables in an MLR model will have a high expense in terms of residual DF,
and the above considerations about correlation between terms would also have to
be taken into account. In PLSR any number of non-controllable variables can be
included, and they can optionally be downweighted in order to discover their
influence on the data without actually allowing them to influence the model. If e.g.
the run order was mixed up in the experiment, a passive descriptor giving the run
order or time-points of the individual measurements will reveal if any effects are
aliased with a time effect.
Analysis with PLS Regression
If some or all of the considerations above make analysis by ANOVA difficult, PLSR can always
be used as a powerful alternative. To get a refresher on the theory of PLSR follow this link.
Include all design variables including any interactions, quadratic or cubic effects of interest in
the descriptor ( ) matrix. Any additional non-controllable variable, background
information about the samples, experimental details such as time of measurement, batch, or
change of instruments can be included here as well. Include all response variables. Weight
all variables with 1/SDev, or optionally downweight some of the descriptors.
Validate with cross-validation. The level of validation depends on the cross-validation
segments. If e.g. all experimental runs are replicated once, the replication error can be
assessed by leaving out a full set of experimental runs in two cross-validation segments.
Note that this will not tell you how well the model will predict new samples but rather it will
reflect the experimental error in the experiment. In order to estimate how well the model
predicts new measurements (when level combinations are allowed to vary within the design
region), keep out all replicates of each point once. This will be a more conservative and
correct estimate for the predictive power of the model.
Include the uncertainty test to get an estimate of the significance of the effects. The
following are important tools to interpret the model and make conclusions:
Weighted Beta coefficients with their uncertainty limit

329
The Unscrambler X Main

The weighted B-coefficients are used to determine which effects are the most
important and their direction of influence. Effects with high positive or negative
regression coefficients have a larger influence on the response in question.
The uncertainty test shows which effects are significantly non-zero, averaged over
responses. Coefficients with high absolute values and little variation across cross-
validation segments will point to significant effects.
Estimated p-values
The uncertainty test will estimate p-values for all effects and interactions included in
the PLSR model. These are based on the size and stability of the PLSR regression
coefficients in the cross-validation.
Explained variance
This plot will reveal the optimal number of components in the model, its fit (blue
line) and predictive ability (red line). The optimal number of components
corresponds with the number of independent phenomena in the data that exceeds
the noise level of the measurements.
Correlation loadings
The loadings or loading weights will reveal the main dependencies between
descriptors and responses in two dimensions. Often these dimensions will capture
the majority of the co-variation between descriptors and responses.
The correlation between the factors and each original variable is captured by the
distance from the origin in the correlation loadings plot. Even downweighted
variables are easily mapped in these plots.
Outlier detection
The sample outlier or influence plots can reveal erroneous measurements or typos
that should be mended or removed.
Predicted vs. Reference
Used to assess the model’s goodness of fit (blue points) and predictive ability (red
points) for each response variable, look for deviating runs and assess prediction
statistics.
When data are missing or experimental conditions have not been reached
In a real life situation it is not always possible to reach the target for the experimental
conditions or an experiment may not go as planned. In such cases one cannot apply the
classical DOE analysis methods. In these situations one can use a PLS fitting method. The
validation procedure of the PLS by jack-knifing will provide approximate p-values for the B-
coefficients, see above chapter on Analysis with PLS regression.
More information on PLS regression can be found in the chapter on Partial Least Squares

8.2.14 Advanced topics for unconstrained situations


In the following section, a few tips that might come in handy when building a design or
analyzing designed data are presented.

How to select design variables


Choosing which variables to investigate is the first step in designing experiments. That
problem is best tackled during a brainstorming session in which all people involved in the
project should participate, reducing the likelihood of overlooking an important aspect of the
investigation.

330
Design of Experiments

For a more extensive screening, variables that are known not to interact with other variables
can be left out. If those variables have a negligible linear effect, one can choose a constant
level for them (e.g. the least expensive). If those variables have a significant linear effect,
they should be fixed at the level most likely to give the desired effect on the response.
The previous rule also applies to optimization designs, if it is known that the variables in
question have no quadratic effect. If it is suspected that a variable can have a nonlinear
effect, it should be included in the optimization stage.

How to select ranges of variation


Once the variables to be investigated have been defined, appropriate ranges of variation
remain to be established.
For screening designs, one is generally interested in covering the largest possible region. On
the other hand, no information is available in the regions between the levels of the
experimental factors unless it is assumed that the response behaves smoothly enough as a
function of the design variables. Selecting the adequate levels is a trade-off between these
two aspects.
Thus a rule of thumb can be applied: Make the range large enough to give an effect and
small enough to be realistic. If it is suspected that two of the designed experimental runs will
give extreme, opposite results, perform those first. If the two results are indeed different
from each other, this means that enough variation has been generated. If they are too far
apart, and too much variation has been generated, the ranges should be decreased some. If
they are too close, try a center sample; as they might just have a very strong curvature!
Since optimization designs are usually built after some kind of screening, one should already
know roughly in what area the optimum lies. So unless a CCD is being built as an extension
of a previous factorial design, one should try to select a smaller range of variation. This way
a quadratic model will be more likely to approximate the true response surface correctly.

The importance of having measurements for all design samples


Analysis of effects and response surface modeling, which are specially tailored for
orthogonally designed data sets and are ideally run if response values are available for all
the designed samples. The reason is that those methods need balanced data to be fully
applicable. As a consequence, one should exercise great care when collecting response
values for all experiments. If a measurement is lost, for instance due to some instrument
failure, it might be advisable to redo the experiment later to collect the missing values.
If, for some reason, some response values simply cannot be measured, one can still to use
the standard multivariate methods available in The Unscrambler®: PCA on the responses,
and PCR or PLSR to relate response variation to the design variables.

8.2.15 Advanced topics for constrained situations


This section focuses on more technical or “tricky” issues related to the computation of
constrained designs.

Is the mixture region a simplex?


In a mixture situation where all concentrations vary from 0 to 100%, it was shown in the
mixture design section that the experimental region has the shape of a simplex. This shape
reflects the mixture constraint (sum of all concentrations = 100%).

331
The Unscrambler X Main

Note: If some of the ingredients do not vary in concentration, these are left out
from the mixture equation such that the ‘total amount’ refers to the sum of the
remaining mixture components. For instance if one wishes to prepare a fruit punch
by blending varying amounts of watermelon, pineapple and orange juice, with a
fixed 10% of sugar, the mixture components sum to 90% of the juice blend but to
100% of the ‘total amount’ (mixture sum). This ensures that the three mixture
components will span a 2-dimensional simplex that can be modeled by a regular
mixture design.
Whenever the mixture components are further constrained, like in the example shown
below, the mixture region is usually not a simplex.
With a multilinear constraint, the mixture region is not a simplex

In the absence of multilinear constraints, the shape of the mixture region depends on the
relationship between the lower and upper bounds of the mixture components. It is a simplex
if for each mixture component, the upper bound + the sum of lower bounds for the
remaining components equals 100% (the total amount).
The figure below illustrates one case where the mixture region is a simplex and one case
where it is not.
Changing the upper bound of watermelon affects the shape of the mixture region

In the leftmost figure, the upper bound of watermelon is 100% - (17% + 17%) = 66%, and the
mixture region is a simplex. If the upper bound of watermelon is shifted to 55% as in figure
to the right, this value will be smaller than 100% - (17% + 17%) and the mixture region is no
longer a simplex.
Note: When the mixture components only have lower bounds, the mixture region is
always a simplex.

How to deal with small proportions


In a mixture situation, it is important to notice that variations in the major constituents are
only marginally influenced by changes in the minor constituents. For instance, an ingredient
varying between 0.02% and 0.05% will not noticeably disturb the mixture total; thus it can
be considered to vary independently from the other constituents of the blend.
This means that ingredients that are represented in the mixture with a very small proportion
can in a way “escape” from the mixture constraint.

332
Design of Experiments

So whenever one of the minor constituents of a mixture plays an important role in the
product properties, one can investigate its effects by treating it as a process variable.

Is a mixture design necessary?


A special case occurs when all the ingredients of interest have small proportions. Consider
the following example: a water-based soft drink consists of about 98% of water, an artificial
sweetener, coloring agent, and plant extracts. Even if the sum of the “non-water”
ingredients varies from 0 to 3%, the impact on the proportion of water will be negligible.
It does not make any sense to treat such a situation as a true mixture; it is better addressed
by building a classical orthogonal design (full or fractional factorial, central composite, Box-
Behnken, depending on the design objectives).

How to select reasonable constraints


There are various types of constraints on the levels of design variables. At least three
different situations can be considered.
 Some combinations of variable levels are physically impossible. For example: a
mixture with a total of 110%, or a negative concentration.
 Although the combinations are feasible, they are not relevant, or they will result in
difficult situations. Examples: some of the product properties cannot be measured,
or there may be discontinuities in the product properties.
 Some of the combinations that are physically possible and would not lead to any
complications are not desired, for example the cost of the ingredients may be
prohibitive.
During the define stage of a new design, give careful attention to any constraint that may be
introduced. An unnecessary constraint will not help solve the problem faster; on the
contrary, it will make the design more complex, and may lead to more experiments or
poorer results.
Design constraints
The first two cases mentioned above can be referred to as design constraints
because they should be included in the design itself. They cannot be disregarded
because if they are, one will end up with missing values in some of the experiments,
or uninterpretable results.
Optimization constraints
The third case can be referred to as an optimization constraint. Whenever
considering introducing such a constraint, examine the impact it will have on the
form of the design. If it turns a perfectly symmetrical situation, which can be solved
with a classical design (factorial or classical mixture), into a complex problem
requiring a D-optimal algorithm, it may be better to disregard the constraint.
For the third situation, build a standard (orthogonal or mixture) design and take the
optimization constraint into account afterwards, at the result interpretation stage. For
instance, a constraint on one or multiple design or response variables can be added to a
response surface plot, and the optimum solution selected within the constrained region.
This also applies to upper bounds in mixture components. As mentioned in the section on Is
the Mixture Region a Simplex?, if all mixture components have only lower bounds, the
mixture region will automatically be a simplex. It is important to keep this in mind so to
avoid imposing an upper bound on a constituent playing a similar role to the others. Expense
of material (thereby limiting its usage to a minimum) should not be considered an option for

333
The Unscrambler X Main

an important study. This can be done at the interpretation stage, where the mixture that
gives the desired properties with the smallest amount of that constituent is chosen.

8.3. Insert – Create design…


A new design is created by using the menu Insert – Create design…, which will open the
Design Experiment Wizard. This dialog contains a sequence of tabs, where the next tab
content often depends on the input in the previous tab.

 General buttons
 Start
 Define Variables
 Choose the Design
 Design Details
 Plackett-Burman designs
 Fractional factorial designs
 Full factorial designs
 Full factorial designs without blocking
 Full factorial designs with incomplete blocking
 D-optimal designs
 D-optimal designs including mixture constraints
 Central Composite and Box-Behnken designs
 Mixture designs
 Simplex mixture designs
 Non-simplex mixture designs and process+mixture designs
 Additional Experiments
 Randomization
 Summary
 Design Table

8.3.1 General buttons


Cancel
At any time it is possible to exit the Design Experiment Wizard and go back to The
Unscrambler® main interface by pressing the Cancel button.
Finish
At the bottom of each tab, the Finish button is located. Initially this is disabled:

When sufficient information has been entered into the tab, the Finish button is made active:

By pressing this button all tasks in the design wizard are ended and the design is created in
The Unscrambler® navigator.

8.3.2 Start
The first tab in the sequence is divided in four sections:

 Name
 Goal

334
Design of Experiments

 Description
 History

Start tab

Name
By default the design will be named “MyDesign”. You may change this to the name you
would like the design to have in the project navigator later.

Goal
Select the most appropriate goal of the experiment. Based on this selection and the
number/type of design variables, the wizard will propose a suitable design.
Screening
In a screening experiment the goal is to isolate design variables that have a
significant main effect on the response variable(s).
When selecting this goal, the Design Experiment Wizard will favour either a Plackett-
Burman design or a low resolution Fractional Factorial design, provided the design
variables are not under any constraints. For mixtures an Axial design will be
suggested, and a low number of samples will be suggested if a D-optimal design is
selected.
Screening with interaction
In a screening with interaction experiment (often referred to as a factor influence
study) the goal is to assess both the main effects and the interactions of the design
variables on the response variable(s).
When selecting this goal, the Design Experiment Wizard will favour either a higher
resolution (IV or V) Fractional Factorial or a Full Factorial design, provided the
designed variables are not under any constraints. For mixtures a Simplex Lattice
design will be suggested, and the default terms and number of samples for a D-
optimal design will be adjusted accordingly.
Optimization

335
The Unscrambler X Main

When choosing optimization as the goal, the design investigates main effects,
interactions and square terms on the response variable(s).
By choosing optimization as the goal, the Design Experiment Wizard will favour
either a Central Composite or Box-Behnken design, provided the designed variables
are not under any constraints. The suggested mixture design will be a Simplex
Centroid design, and the number of terms and samples for a D-optimal design will
be higher.

Note: In Optimization no category variables can be optimized. If there are category


variables to be investigated it is necessary to break down the design strategy into
two stages:
 Find the optimum levels for category variables (include the possible non-
category variable that can interact with them).
 Find the optimum for the non-category variables using the optimized values
for the category variables.

Description
Edit the blank section to store information on the design and specific details about the
experiments.

History
This part contains information on the history of the design such as the creator, the date of
creation and possible revisions. It is auto-generated by the Design Experiment Wizard.

8.3.3 Define Variables


In this tab, define the design space as well as other variables such as the response variables
and the non-controllable variables.
It is divided into two sections:
 Variable table, which displays the defined variables.
 Variable editor, which allows the addition of new variables or the deletion/editing of
previously defined variables.
Define variables tab

336
Design of Experiments

Variable table
This table contains information on all the variables to be included in the experiment. The
variables are ordered as follows:

 Design variables (factors, components)


 Response variables
 Non-controllable variables

The variables can be re-ordered within their category by using Ctrl+arrow up or down.
To edit a variable, highlight the corresponding row, modify the information in the variable
editor,and click OK.
To delete a variable, highlight the corresponding row and click the Delete button.

Variable editor
Click the Add button to add a new variable.
Specify the characteristics of the new variable as follows:
ID
The identity of the variable will be auto-generated. Design variables will have upper
case IDs (A-Z, except reserved letter I), response variables will have integer IDs, and
non-controllable variables will have lower case IDs (a-z, except i). Design variables
no. 26 and onwards are denoted A1, B1, etc.
Name
Enter a descriptive name in the Name field. If nothing is added here, the ID will be
used as name.
Type
Select the variable type by from the following list using the radio buttons:

 Design: Design variables (factors) submitted to experimentation.


 Response: Measured variables assumed to depend on the levels of the
design variables.

337
The Unscrambler X Main

 Non-controllable: Variables not submitted to experimentation but may have


an effect on the design. They can be measured for the purpose of including
them in a regression model.

Constraints
Select the appropriate constraint setting for the variable (by default no constraints):

 Linear: If at least two variables are submitted to a common constraint, for


example , they should be defined as having linear
constraints.
 Mixture: If at least three variables are part of a mixture, they may be defined
as having a mixture constraint. This implies that the sum of all mixture
components equals the Mixture Sum (100%).

Type of levels
The levels are either continuous or category:

 Use Continuous if the variable is measured on a continuous scale. This


means that it is possible and that it makes sense to rank the levels with
respect to each other. For example high level is larger than low level and
values in between the upper and lower level exist. Only two levels are
specified for continuous variables.

 Use Category if the variable can change between 2 or more distinct levels or
groups, but where one group/level cannot be ranked on a numerical scale in
relation to the others. For instance the level ‘apple’ cannot be ranked as
higher/lower/better/worse than level ‘pear’. Similarly it is not possible to
calculate an average level between category groups. Two or more levels can
be defined for category variables (max. 20). If category variables of more
than two levels are included, the only available design will be the Full
Factorial (without blocking).

Note: Never define a numeric variable as category in order to enable more levels in
the design. These are interpreted differently and the analysis will be wrong. For
optimization designs that require more than two levels to fit a response surface,
additional levels will be added later based on the defined high and low levels.

Level range / Levels

 For continuous variables: place the bounds of the design space with the low
and the high values in the Level range field. By default the levels are -1 and
1 (or 0 and 100 for mixture variables)
 For category variables: the Levels section makes it possible to edit the
numbers and names of the level. The default values are “Level 1” and “Level
2”.

Units
Specify any unit for the variable in question. For mixture variables the default unit is
’%’.

338
Design of Experiments

Mixture Sum
(Available for mixture variables only.) This is the sum of all mixture components in
the blend. The default value is 100 (%), but any positive value is allowed.

8.3.4 Choose the Design

Different types of experimental design


Different designs can be created depending on the:

 Number of variables
 Constraints on the variables
 Goal of the experiment.

The Unscrambler® suggests the most appropriate design following some rules. Use the
radio-buttons to select a different design than the suggested one. Note that there are
limitations on which designs can be selected based on the number and type of design
variables, however the goal of the experiment can be overridden by the user. The suggested
design remains displayed in bold.
When a full factorial design is selected, a check-box is used to enable (incomplete) blocking.
Select blocking in cases where groups of experimental runs have to be performed under
different settings. For instance if one batch of raw material is insufficient for the full
experiment, different batches will have to be used for different runs. Blocking ensures that
any potential batch effect will not be confounded with other important effects such as main
effects.

Beginner and expert mode


In Beginner mode, the design description is intuitive for those not experienced with DoE. In
Expert mode, select the design by choosing the actual design name.
It is possible to change the view by using the Beginner/Expert cursor
.
Choose the design tab in Beginner mode

339
The Unscrambler X Main

Information
The information box provides information on the selected design.

Design selection criteria used by the design wizard


The Design Experiment Wizard will always suggest a design taking into account 3 pre-defined
criteria:

 Goal
 Number of variables
 Constraints on the variables

The rules are as follows


 In situations where no constraints are applied:
If the goal is Screening and # of variables ≥ 11, then a Plackett-Burman design is
selected.
If the goal is Screening and # of variables > 2 and < 7, then a fractional factorial
design of resolution III is selected.
If the goal is Screening with interaction and # of variables > 4, then a fractional
factorial design is selected. Make sure to select a resolution IV design or higher.
If the goal is Screening with interaction and # of variables ≤ 4, then a full fractional
design is selected.
If the goal is Optimization and # of variable ≤ 6, then a Central composite design is
selected.
If the goal is Optimization and # of variable > 6, this is not possible as too many
experiments are required to be practically feasible. The optimization should be
performed in steps.
 In the situation where Mixture constraints are applied:
At least 3 mixture variables have to be defined. If the experiment contains mixture
variables only, a mixture design will be suggested by default. Depending on the

340
Design of Experiments

defined goal: Screening selects an axial design, Screening with interaction selects a
Simplex-Lattice design and Optimization selects a Simplex-centroid design.
If additional constraints on the mixture components are imposed, the design region
might be non-simplex. Also, if process (i.e. non-mixture) variables are included
together with the mixture components, regular mixture designs cannot be used. The
appropriate choice for these setups is a D-optimal design.
 In the situation where linear constraints are applied, for non-simplex mixture
designs, or for designs containing both process and mixture variables:
The appropriate choice is a D-optimal design. Designs with less than two process
variables or at least three mixture variables are not allowed.

8.3.5 Design Details


This tab is allows a user to define the details of the various designs.
Plackett-Burman designs
When a Plackett-Burman design is selected, the Design Details tab displays a list of design
variables and a summary of the size of the design.
Design Details: Plackett-Burman

Fractional factorial designs


For a fractional factorial design there may be several possible resolutions corresponding the
available confounding patterns.
To change the resolution and the confounding pattern, there are two options:

 Use the drop-down box to select among the available number of design points
 Change the resolution with the radio buttons.

Design Details: Fractional factorial design

341
The Unscrambler X Main

The confounding patterns for the selected design is displayed in a separate box. They can be
visualized using the variable ID in the form : A + BC, or using the names of the variables. To
see the variable names, tick the box Show names.
After finishing a fractional factorial design, the resolution and confounding patterns will be
given in the Info box below the project navigator.
Full factorial designs
The Design Details tab looks different depending on whether blocking was selected in the
previous tab.

Full factorial designs without blocking


Details about the design variables and number of experiments are shown.
Design Details: Full factorial without blocking

342
Design of Experiments

Full factorial designs with incomplete blocking


When blocking is selected, the available number of blocks (per design replicate) is selected
in the Number of blocks drop-down box.
Depending on the number of blocks, the Block Generators are displayed in a separate
frame. These are given capital letter ID’s similar to the design variables, but they are dummy
variables used for blocking only. They are named Generator_1, Generator_2, etc.
Design Details: Full factorial with blocking

The blocking generators, as well as all their confounding interactions, will be treated
separately from the remaining effects in the subsequent ANOVA. This means that no results
will be returned for any effects confounded with blocks. The Patterns frame allows
identification of the effects confounded with blocks.
After finishing a full factorial design with incomplete blocking, the block confounding
patterns will be given in the Info box below the project navigator.
D-optimal designs
This design type corresponds to variables with constraints applied, such as:

 Multilinear constraints on some variables


 Mixture variables with upper bounds that result in a non-simplex design region
 A combination of mixture and process variables.

This tab is used to:

 Set the constraints


 Set interactions and squares
 Edit the design settings
 Generate the design

Design Details: D-optimal design

343
The Unscrambler X Main

Note:

 Adding variables with linear constraints automatically leads to a D-Optimal


design.
 Defining both mixture and process variables automatically leads to a D-
Optimal design.
 No multilinear constraints can be defined including category variables.

Set the constraints


The Multilinear constraints frame include a window where all the design constraints are
displayed as well as an Edit button. Clicking this button will open a dialog where multilinear
constraints can be added, edited or removed.
Editing multilinear constraints

To add a new constraint, use the button Click to add new constraint. A list of all design
variables that are defined to have either Linear or Mixture constraints will be available for

344
Design of Experiments

editing. Select a multiple of each constrained variable, or set a variable to 0 if it is not part of
the current constraint.
The operator to be used in the multilinear constraint is selected from the drop-down list:

The ’<’ and ’>’ operators are convenience functions only. On setting up the candidate points
the ‘<=’ and ‘>=’ will used instead, but with the target value modified down or up by 0.01
compared to the specifed target. After specifying the target value, the new constraint will be
added to the Current constraints box.
Repeat the above procedure for adding additional constraints, or edit an existing constraint
by clicking on the relevant box in Current constraints.
If mixture variables are included in the design, a constraint that they sum to 100% (as given
by the Mixture sum), is added automatically. This constraint cannot be edited or removed.
To delete a constraint select it in the Current constraints table and click on the Delete
button.
Click OK when all of the desired constraints have been added. The constraints will then be
tested if they are both active and consistent.
An inactive constraints is one that is superfluous because it does not constrain the design
region as specified by the variable levels. If for instance the ranges of A and B are both [0
10], a constraint that A+B>=0 will be inactive.
Inactive constraint warning

An inconsistent constraint is a constraint that is impossible based on the design variable


levels. A constraint that A+B>=30 for the above design will be inconsistent, because the sum
of A and B at their maximum levels is 20.
Inconsistent constraint warning

If a constraint is found to be inactive or inconsistent it should be reviewed carefully. When


all constraints are valid, click OK again to close the dialog. All specified constraintswill then
be listed in the main dialog window.

345
The Unscrambler X Main

Set interactions and squares


Any D-optimal design will include the main effects of all design variables as a minimum. In
addition some types of interaction and square terms are available depending on the type of
design variables included. These are

 Second order mixture: These are all 2-variable interaction terms between the
mixture components;
 Process interactions: These are all 2-variable interaction terms between the process
variables;
 Process squares: These are all quadratic terms of the process variables;
 Mixture and process interactions: These are all interactions of the first order mixture
terms with any first or second order process term.

Check the appropriate boxes to pre-select any of these groups of terms. For designs with
process (non-mixture) variables only, use the following guidelines:

 Screening: the model to study is a linear model. No need to add interaction or


square terms
 Screening with interaction: the model should include the process interaction terms.
 Optimization: the model should include the process interactions as well as the
process squares.

For mixture designs, include second order mixture terms if the goal is Screening with
interaction or Optimization.
For process/mixture designs it may be useful to optimize either the process or mixture
variables, while sampling for the main effects only of the remaining group. It is also possible
to include the second order terms for both types of variables while not including interactions
between the two. By assuming that there are no interactions between the process and
mixture variables, the number of experiments can be greatly reduced.
For a more specific selection of model terms click the Modify button. This will bring up a
dialog listing all higher order terms available for selection. The selected effects are listed in
the left box and the non-selected effects are listed in the right box. All main effect terms
(and offset if non-mixture design) are included by default and will not be listed. Any second
order mixture, process interaction and process square terms will be available for selection.
Any mixture and process interaction terms will be available for selection only if this box is
checked in the Model terms frame.
Dialog for selection of interaction and square terms

346
Design of Experiments

The Add and Remove buttons can be used to move highlighted terms from
one box to the other. The Add All and Remove All buttons do the same for all available
terms. The Add Int button adds all second order mixture as well as process interaction terms
to the model, whereas Add Square moves all process square terms to the Selected Effects
box. Click OK to keep the changes or Cancel to discard them. If some but not all of the terms
of a given order are selected, the corresponding check-box will be in a full state
(intermediate between checked and empty states).
Edit the design settings
The total number of design points is divided between a number of D-optimal design points,
space filling points and additional center points. The default sum of D-optimal and space
filling points is given by the number of model terms and the Goal of the experiment. An
offset is included in the model terms only if no mixture components are specified.

 If Goal=Screening, three points more than the number of model terms is suggested,
and three additional center points.
 If Goal=Screening with interaction, six points more than the number of model terms
is suggested,and four additional center points.
 If Goal=Optimization, nine points more than the number of model terms is
suggested, and five additional center points.

The minimum number of design points is the same as the minimum number of D-optimal
points. These are limited by the number of model terms.
The maximum number of design points is the same as the maximum number of D-optimal
points, which is limited by the number of candidate points. As the candidate points are
generated only when the Generate button is pressed, a warning will be given if too many
design points are specified.
The minimum number of space filling and additional center points are zero. Note that the
candidate points list will contain one center point which might be added even though the
number of additional center points is set to zero.
Change the default number of center points in the Additional Experiments tab. Note that the
center sample coordinates will be calculated (or re-calculated) only when the Generate
button is pressed.

347
The Unscrambler X Main

An Advanced Design Settings dialog opens when clicking the More button. Three settings are
tuned in this window
 Number of initial tries: There is no guarantee that a single run of the D-optimal
algorithm will return the globally optimal set of design points. To avoid getting stuck
in local optima the algorithm can be run multiple times using different starting
conditions. Only the result with highest D-optimality is returned. The default number
of initial tries is 5, and this value can be changed between 1 and 1000.
 Random points in the initial sets: To speed up the algorithm the starting set is not
completely random. Rather a smaller random set is used and points are added
sequentially to maximize the D-optimality of the starting design. The number of
random points in the initial sets can be tuned between the the number of model
terms and the specified number of D-optimal points.
 Max number of iterations: Here you can set an upper limit on the number of point
exchange operations that will be performed. The default limit is 100, the lower limit
is 10 and the upper limit is 1000 iterations. You may try to increase the number if
you experience convergence problems.
The Advanced Design Settings dialog

Click OK to keep the changes or Cancel to discard them.


Generate the design
A sequence of operations is performed when the Generate button is pressed. First the
candidate point list is generated based on the constraints. The number of candidate points is
the effective upper limit on the number of design points, and a warning will be given if too
many design points have been specified. Also the center point coordinates are generated
and will be displayed in the Additional Experiments tab. Then the specified number of D-
optimal points is found by the exchange algorithm, before these points are supplemented
with the specified number of space filling points and finally with the number of additional
center points.
The resulting design matrix is returned and the condition number is displayed in the Design
Experiment Wizard. The condition number is an indication of the orthogonality of a design,
and the lower condition number the better.

D-optimal designs including mixture constraints


If three or more variables are defined to have Mixture constraints, a D-optimal design can be
generated. If there is a combination of process and mixture variables, a D-optimal design is
the only available option. Also if the upper level of one or more of the mixture components
is lower than the Mixture Sum, or if additional constraints are imposed on them, the design

348
Design of Experiments

region may have a non-simplex shape. D-optimal designs should be used for non-simplex
design regions as the standard mixture designs will not work.
Such a design is set up in a similar manner to a D-optimal design without mixture
components. The main difference is that a mixture constraint including all mixture
components is added automatically. These are required to sum to 100%.

Note: Currently classical ANOVA and response surface plots are not available for
non-simplex and process/mixture designs. In order to take advantage of these
features, you might consider if a regular mixture design could be an alternative.

Central Composite and Box-Behnken designs


Available optimization designs are:

 Circumscribed Central Composite (CCC)


 Inscribed Central Composite (ICC)
 Faced Central Composite (FCC)
 Box-Behnken (BB)

Use the radio buttons to select the most appropriate design. For more information on these
designs please refer to the Theory section.
Design Details: Central Composite and Box-Behnken designs

The star point distance is the distance from the origin to the axial points in normalized units
(i.e. given that upper and lower levels of factorial points are 1 and -1, respectively). The
default star point distance for CCC designs ensures rotatable designs. For ICC designs, the
inverted value is used, which will for give rotatable designs by default also for ICC designs.
The star point distance for FCC designs is always 1 (non-rotatable).
The following table is given as a guide to find the most appropriate design:
Uses point outside
Number of
Design high and low Accuracy of estimates
levels
levels

Circumscribed 5 Yes Good over entire design space

349
The Unscrambler X Main

Uses point outside


Number of
Design high and low Accuracy of estimates
levels
levels

Good over central subset of the


Inscribed 5 No
design space

Fair over entire design space, poor


Faced 3 No
for pure quadratic coefficients

Good over entire design space, more


Box-Behnken 3 No uncertainty on the edge of the
design area

Mixture designs

Simplex mixture designs


Whenever three or more variables with Mixture constraints are defined, and there are no
other variables in the design, the mixture design tab is accessible.
Design Details: Mixture design

Axial
In an axial design all points lie on axes that go from each vertex through the overall
centroid, ending up at the opposite surface or edge. At these end points the
component in question is zero and the remaining components have equal
concentrations.
The end points allow the study of blending processes where each component may
be reduced to zero concentration. These can optionally be left out from the
experiment by un-checking the Include end points box.
Simplex lattice
A simplex lattice design is the mixture equivalent of a full-factorial design where the
number of levels can be tuned. It can be used for both screening and optimization
purposes, according to the lattice degree of the design.

350
Design of Experiments

The Lattice degree equals the number of segments into which each edge is divided.
This corresponds to the maximal order that can be calculated for the subsequent
model. Edit the degree by changing the default value.
Simplex centroid
A Simplex centroid design consists of extreme vertices, center points of all “sub-
simplexes”, and the overall centroid. A “sub-simplex” is a simplex defined by a
subset of the design variables.
Simplex centroid designs are well suited for optimization purposes. If Augmented
design is checked, axial check blends are added to the design. These are the same as
the Axial points in an Axial design.
Adjust mixture levels
There are certain limitations on which ranges are allowed for the components in a
mixture design:
1) The design levels must be consistent. This has to do with the mixture constraint
that all component concentrations must sum to the Mixture Sum (100%). If for
instance the lower level of one component is constrained to 20%, the upper level of
the remaining components cannot exceed 80% (see image below).
2) Any (consistent) design region has to be of simplex shape, i.e. it must form a
triangle for 3 components, a tetrahedron for 4 components, etc. Imposing upper
limit constraints on some of the mixture components will often lead to a non-
simplex design region.
A mixture design is automatically tested for condition 1) above, and if the design is
consistent it is tested for condition 2). If either test fail, a warning is given and an
Adjust mixture levels button is activated. Clicking this button will open an adjust
mixture levels dialog with several options.
Adjust Mixture Levels

 Make levels consistent: Active whenever the test for consistency fails. The
bounds will be adjusted for consistency with the mixture constraint.

351
The Unscrambler X Main

 Reset to user specified levels: Active whenever modifications have been


done to the constraints within the dialog. Reverts any modifications to those
originally defined.

 Adjust with normalized levels: Active whenever any range differs from the
default [0, 100%]. All mixture bounds will be adjusted to their maximum
range as bounded by 0 and the Mixture Sum.

 Switch to d-optimal: Active whenever the design is consistent but non-


simplex. Applies any changes to the constraints, closes the dialog and
switches to the tab for D-optimal designs.

 Adjust to simplex: Active whenever the design is consistent but non-


simplex. Applies a general adjustment to turn the experimental region into a
simplex shape. The pre-defined upper and lower levels may be exceeded.

On pressing OK, the upper and lower levels of the components are updated with the
new values. If Cancel the dialog is closed without taking any changes into account.
Only when the mixture design is both consistent and of simplex shape will the Finish
button be activated in the Design Experiment Wizard.

Non-simplex mixture designs and process+mixture designs


In the situations where imposed upper bounds or multilinear constraints lead to a non-
simplex design region, or where a combination of mixture and process variables are to be
analysed a D-optimal design is required.

8.3.6 Additional Experiments


This tab allows one to manage the replication of the design as well add center points and
reference samples.
It includes four sections:

 Design variables
 Replicated samples
 Center samples
 Reference samples

Additional experiment tab

352
Design of Experiments

Design variables
The design variables table provides a running summary of the design variables’ levels and
constraints.

Replicated samples
The number of replicated samples indicates the number of times the base design
experiments are run. Replication is used to measure the experimental error. Usually this is
done on center samples, however increasing the number of replicates in the design
improves the precision estimates of the design, by measuring replicates over the entire
design space. It is suggested to use at least two replicates of the design if the experimental
results are likely to vary significantly during the running of the experiment.
Note: Replicates (or replicated samples) are not the same as repeated
measurements. Replicates require a new experiment to be run using the same
settings for the design variables with a new experimental setup, while repeated
measurements are measures performed on the same samples numerous times in a
short time period.

Center samples
Center samples are used as a test for curvature and as a source for error variance
estimation. In the latter case, use at least two (preferably three or more) center samples as
this improves the precision of any estimates. By default the Design Experiment Wizard
suggests a number of center samples. These can be modified by using the spin box next to
Number of center samples.
The center samples are experimental runs at the mid-level of the design variable ranges
when all design variables are continuous. This corresponds to the average (mean) of the
different variables in the design.
If 1-4 variables in the design are categorical and at least one is continuous, center points can
still be defined, however these are only defined for the continuous variables in the design.

353
The Unscrambler X Main

Then a specified number of center point will be given for all combinations of categorical
levels. This ensures that the resulting design remains orthogonal.
An example is shown below for the simplest 2 factor factorial design at two levels, with one
category and for the 3 factor case with one center point defined.
Center point configurations of two factorial designs with one category variable

For the above designs it can be seen that two center points are required when there is one
categorical variable in the design. The center point is located at the mid-point of the
remaining continuous variables. The diagram below shows the 3 factor design with two
categorical variables, in which case 22 = 4 center points are needed.

In the situations described above, one replicate of center points was defined. In this case,
pure error cannot be calculated as the center points are all unique. In order to calculate pure
error, replicates of these center points is required. For the 2 factor design, two replicates of
center points yields 4 center points in total. Each center point now provides 1 degree of
freedom each per categorical level, i.e. 2 degrees of freedom in total for pure error.
For the 3 factor example with two categorical variables, two replicates of center points
results in 8 runs for center points alone. In this case, there are 4 unique center points,
therefore this situation provides 4 degrees of freedom for pure error. The more categorical
variables, the more center points are required, i.e. 2 center points minimum per categorical
variable. If replication is required, the number of center points can increase rapidly, to the
point where the number of center points exceeds the number of design points. In these

354
Design of Experiments

cases, the experimenter should assess if design replication is a better choice, or a


combination of a design replicate and a single replicate of center points. This depends on the
goal of the design and the budget one has for the experimentation. Also, refer to the section
below on modification of center points which describes how to modify and delete specific
center points.
Note: For designs with more than 1-2 categorical variables, it is usually both more
informative and more economical to replicate the entire experiment than to add
center points.

Modification of center points


It is possible to modify center points by double-clicking on the sample, which will open a
dialog box for editing.
Modify center sample

In the example presented here, variable D is categorical. Its value can be changed using the
drop-down list. It is also possible to delete this specific center sample by clicking on the
Delete button. When the level values for the category variables have been specified, click
OK.

Reference samples
In the field reference samples, it is possible to define samples which are incorporated for
comparison. A typical reference sample is a target sample, a competitor’s sample or a
sample produced after changes to a given recipe. The values of the design variables are not
entered and are set as missing; it can be modified later in The Unscrambler®.

8.3.7 Randomization
This tab allows a user to randomize the order of the experiments.
Randomization tab

355
The Unscrambler X Main

Randomization is used to avoid bias induced by sequential experimentation. However it is


sometimes necessary to perform some experiments in sequence, for example, if a
parameter is difficult to change (for example, the temperature of a blast furnace). In such
cases, it may be more practical to make all experiments with the same temperature at the
same time. In the Randomization tab, it is possible to specify blocks of similar samples to be
kept together during randomization.
Designed variables to randomize
This table displays the randomization pattern of the designed variables. It is possible
to edit the randomization pattern of the variables by clicking on the Detailed
randomization button.
By clicking on this button a new window opens. The selected variables (including
center points) will be randomized. When the desired pattern has been achieved,
click OK.
Define randomization

Randomized experiments
This table shows the sequence of experiments to run.

356
Design of Experiments

Re-randomize
If for any reason it is necessary to change the order of the samples, select the Re-
randomize button, and a new sequence of experiments will be generated.

8.3.8 Summary
This tab gives a summary of the complete design set-up, as well as the ability to calculate the
power of the design to detect small changes in the individual responses. A small change
means that the effect should be significant at a 5% level.
Summary tab

In order to calculate the power of the design:


 Enter the following parameters into the respective fields:
Delta
the required difference to detect in the response for successful experimentation
Std. dev. (also called Sigma)
the estimated standard deviation on the reference method used to obtain the
response
The ratio Signal to Noise (S/N) is provided as an indication.
 Click the Recalculate power button. The power for each response variable will be
displayed in the Power field.
The power of the design is its estimated ability to detect small but real changes in the
response values. Traditionally a power of 80% is regarded to be good, which would imply a
20% probability of overlooking small effects. If the power of a design is low, one risks
performing expensive and time-consuming experiments that will not provide any answers.
Increase the power by adding additional experiments to the design, e.g. perform an
additional replication.

8.3.9 Design Table


This tab shows the list of experiments to perform.
Design table tab

357
The Unscrambler X Main

Different visualization options are available:


Randomized or Standard sequence
Randomized sequence is the sequence defined in the Randomization section, which
corresponds to the run order. Standard sequence is an ordered sequence
convenient for display.
Display order

Actual values or design levels


Actual values (or Actuals) are the levels as specified in the Define Variables tab,
these are the original units of the design variables.
Design levels are the levels in normalized units, i.e. [-1, 1] for factorial (process)
variables and [0, 1] for mixture components. Also called Level indices or Reals.
Display values

Select the options to be used with the available radio buttons.


After selecting the Finish button, the design matrices will be generated in The Unscrambler®
project navigator.

8.4. Tools – Modify/Extend Design…


To modify or extend a design, use the menu option Tools - Modify/Extend Design….
Modify/Extend Design menu

358
Design of Experiments

A dialog box will appear where one can select the appropriate design matrix to modify in the
field Choose design.
Modify/Extend Design dialog box

When the design is selected click the OK button.


The Design Experiment Wizard will open. The History field of the Start tab will be modified,
and all the variables will be loaded with their previous settings.
Modified History field

Give the new design a unique name, modify any settings and click Finish when satisfied. This
will create a new design table in the project navigator.
All response values will be set to zero in the modified design.
Check the Insert – Create design… section to get more information about the design wizard.

8.4.1 To remember
When extending a design where some experiments have been already run, it is
recommended to add some extra center samples to check for bias with time with the
analysis.
Refer to the theory-section Extending a design for more details.

359
The Unscrambler X Main

8.5. Tasks – Analyze – Analyze Design Matrix…


After clicking on Finish in the Create Design dialog, the design table is displayed in The
Unscrambler® project navigator. The design table contains all design variables (with
interactions), followed by the response variables and non-controllable variables (when
applicable).
The Design table is divided into sets (column ranges) depending on the model complexity:
Designs not containing mixture variables contain some or all of the sets:

 Design
 Response
 Non-controllable
 Main effects
 Main effects + Interactions (2-var)
 Main effects + Interactions (2- and 3-var)
 Main effects + Interactions (2-var) + quadratic
 Main effects + Interactions (2-var) + quadratic + cubic
 Main effects + Interactions (2 and 3-var) + quadratic + cubic

Designs containing mixture variables contain some or all of the sets:

 Design
 Response
 Non-controllable
 First order (Linear)
 Second order (Quadratic)
 Special cubic
 Full cubic
 Main effects + Responses

The tables are also divided into three to five sample sets (row ranges):

 All samples
 All design samples
 Center samples
 Design and center samples
 Reference samples

Data sets generated in The Unscrambler®

360
Design of Experiments

8.5.1 Order of the runs


There are two ways in which to order the samples:

 Standard: This is the accepted standard order for design variables. In particular,
factorial designs adopt the standard (1), a, b, ab, … notation.
 Randomized: This order is the one generated after randomization, it provides the
experimental sequence the runs should be performed in.

Standard and randomized order view

The order can be changed by the clicking on one of the two columns and then selecting Edit-
Sort and then choosing Ascending or Descending.
Sort menu

8.5.2 Level values


There are two ways to view the design levels in the table: either in actual values or in leveled
index levels.
Change between these views by ticking or unticking the Level indices option available in the
View menu.

8.6. DoE analysis


Go to Tasks - Analyze - Analyze Design Matrix… to open the Design Analysis dialog. The first
tab is the Model Inputs tab where the input data are specified along with which interactions
or higher order terms to include in the model. The Method tab suggests alternative analysis
strategies based on the input data and allows you to select the preferred method.

361
The Unscrambler X Main

Model Inputs
 Select the Predictors and Responses to analyze. Only data tables created using the
Design Experiment Wizard (Insert–Create Design…) are accepted as input.
 Usually the predefined column sets Design and Response should be selected in the
Cols box of the Predictors and Responses, respectively. Select All rows. Note that
selecting less or more data may alter desireable properties of the design.
 Select the Effects to include in the model. It can include more or less terms. Try a
simpler model first.
 In subsequent analysis, terms can be removed or added to the model. Select the
relevant effects and use the Move button to add/remove them from the analysis.
 For factorial designs with no category variables and at least one centre point, there
is an option to calculate Curvature. A Curvature term can be found in the Not
Estimated box and is calculated by moving it to the Estimated box. Curvature
removes one degree of freedom from Lack of Fit calculations and is used to
determine whether the model is linear or not. Note that even if the curvature term
is added in the ANOVA, the final model (i.e. regression coefficients and predicted
responses) does not include the curvature term. Because the residual degrees of
freedom is reduced when testing for curvature, avoid using it indiscriminantly.

Note: The test for curvature will also remove some variation from the error term. In
some cases this may result in a low p-value for the model even though the model
itself does not include the curvature term. Therefore you should always verify your
final model by recalculating without curvature.

The Model Inputs tab

362
Design of Experiments

Method
Most designs may be analyzed using Classical DoE Analysis, which performs individual
ANOVAs for each response. If the design is heavily constrained or if multiple correlated
responses should be analyzed together, Partial Least Squares Regression may be a better
option. Other changes to a design such as modified factor levels or missing values might also
favour PLSR over ANOVA in some cases. Please refer to the theory section for a discussion
on the limitations of ANOVA.
The Method tab displays some useful properties of the design to make it easier to decide on
the best analysis method.

 Design: This is the name of the design.


 Design Type: This is the type of the design.
 Modified: If at least one of the design level values has been modified in the past,
this value will be set to Yes. Depending on the magnitude of the change, this may
have a high or low impact on the orthogonality properties of the design.
 Kept-out samples: While all samples may be very important in a design, especially
non-replicated ones, things may happen during the experiment or data collection
that lead to missing response values for some samples. This may severely reduce the
quality of the design. The number of kept-out or missing samples in the data table is
given here.

363
The Unscrambler X Main

 Max. R2 Responses: If multiple, correlated responses are selected, attempting to


interpret them under the assumption that they are independent is a difficult (and
risky) endeavor. This value is the highest of all pairwise, squared correlations
between responses. If the value is higher than 0.5 PLSR is suggested by default.
 Condition Number: Constrained (D-optimal) designs and designs with modified
levels or missing runs will be non-orthogonal. As valid interpretation of an ANOVA
model relies on independent design parameters, highly non-orthogonal designs
should be analyzed using Partial Least Squares Regression rather than Classical DoE.
An orthogonal design has condition number of 1, and for any non-mixture design
with condition number larger than 100 Partial Least Squares Regression will be
selected by default. If the value is larger than 1000 Classical DoE will be disabled.
 D-efficiency: This property of the design is closely related to the D-optimality
criterion. A factorial design without center points has a D-efficiency of 100%. This
value decreases if additional points are added that do not contribute to making the
design more orthogonal, or if constraints are added to the design. Useful for
assessing the quality of D-optimal designs.

Note: Modify design levels with caution, as such changes to the design matrix
cannot currently be undone (change back manually or use Tools–Modify/Extend
design if needed).

Note: Mixture designs are by definition non-orthogonal and can have both large
condition numbers and small D-efficiencies. These design can still be analyzed using
Classical DoE.

Select the preferred analysis method using the radio buttons and click OK to perform
analysis.
Analysis with ANOVA

364
Design of Experiments

8.7. Analysis results


A message will appear asking whether you want to display the model plots. Click on Yes or
No and the model will be added to the project navigator named “DOE Analysis”. Each model
contains the nodes Raw data and Results, and, if you decided to display it, Plots. There will
always be an option to right click on the model node in order to show or hide plots.
DOE Analysis results from a classical analysis in project navigator

365
The Unscrambler X Main

For further information on how to interpret the plots that are generated, please refer to the
section on interpreting DoE plots.

8.8. Interpreting design analysis plots


Depending on the method selected to analyze the design data, different results will be
plotted. Select one of the following methods to see the appropriate plot interpretation.

 Accessing plots
 Available plots for Classical DoE Analysis (Scheffe and MLR)
 ANOVA overview
 ANOVA table
 Summary
 Variables
 Model check
 Lack of fit
 Diagnostics
 Effect visualization
 Effect summary
 Effect and B-coefficient overview
 Regression coefficients and their confidence interval
 B-coefficient table
 Effect visualization
 Effect summary
 Residuals overview
 Normal probability of Y-residuals
 Y-residuals vs. Y-predicted
 Histogram of Y-residuals
 Y-residuals in experimental order
 ANOVA table
 Diagnostics
 B-coefficients
 Regression coefficients and their confidence interval
 B-coefficient table
 Effect visualization
 Effect visualization
 Effect summary
 Cube plot
 Error table
 Predicted vs. Reference
 Response surface
 Response surface plot
 Response surface table
 Multiple comparison
 Multiple comparison plot
 Group table
 Distance table
 B-coefficient table
 Available plots for Partial Least Squares Regression (DoE PLS)
 Overview

366
Design of Experiments

 Weighted regression coefficients


 Explained Variance
 PLSR ANOVA p-values
 Predicted vs. Reference
 PLS-ANOVA Summary table
 Normal probability plot
 X- and Y-Loadings

8.8.1 Accessing plots


On finishing the calculation of a DoE model, the user is asked whether to view the plots or
not. Answering Yes will generate a sub-branch of the model called Plots in the project
navigator. This branch contains a number of readily accessible plot nodes.
Project navigator plot nodes

The availability of these plots is toggled by the options ‘Show plots’/’Hide plots’, accessible
from right clicking on the DoE model in the project navigator. This will add or remove the
Plots branch to the model. The plots are also available from the toolbar or from right-
clicking in any of the plot windows.

8.8.2 Available plots for Classical DoE Analysis (Scheffe and MLR)
ANOVA overview
The ANOVA overview plot node contains four plots. The plots described below are given for
all Plackett-Burman, Fractional Factorial and Full Factorial designs (unless otherwise noted).
For Optimization and Mixture designs, the Effect visualization and Effect summary plots are
replaced with a Response surface plot and table.

ANOVA table
The ANOVA table contains all sources of variation included in the model.
Sums of squares (SS)

367
The Unscrambler X Main

This is an unscaled measure of the dispersion or variability of the data table. It is the
sum of squares of the distance from the samples to the average point. It increases
with the number of samples.
All calculations are based on coded levels, i.e. the variable ranges are scaled
between [-1, 1] for process variables and between [0, 1] for mixture variables.
Degrees of freedom (DF)
The number of degrees of freedom of a phenomenon is the number of independent
ways this phenomenon can be varied. In the model there is one DF for each
independent parameter estimated.
Mean squares (MS)
This is the ratio of SS over the degrees of freedom. It estimates the variance, or
spread, of the observations of the different sources in a comparable unit.
F-ratio
This is the ratio between explained variance (associated to a given predictor) and
residual variance. F-ratios are not immediately interpretable, since their significance
depends on the number of degrees of freedom. However, they can be used as a
visual diagnostic: effects with high F-ratios are more likely to be significant than
effects with small F-ratios.
p-value
A small value (for instance less than 0.05 or 0.01) indicates that the effect is
significantly different from zero, i.e. that there is little chance that the observed
effect is due to mere random variation.
There are several types of sources of variations grouped in different parts of the table:

 Summary
 Variables
 Model check
 Lack of fit

In addition, some Quality values are found at the end of the table, including:
Method used
This refers to the type of samples used to calculate the error values. It can take three
values:

 Design: the design is not saturated so the error values can be calculated on
the residual degree of freedom from the model.

 Center: the design is saturated so the error is calculated on additional


experiments: the center samples.

 References: the design is saturated so the error is calculated on additional


experiments: the reference samples.

R-square
Coefficient of multiple determination. A value close to 1 indicates a good fit, while a
value close to 0 indicates a poor fit.

 R-square = 1 - SS(Error) / SS(Total)

368
Design of Experiments

Adjusted R-square
Coefficient of multiple determination adjusted for the DF. While R-square will
increase towards 1 as more parameters (effects) are added to the model, this
statistic will favour additional terms only if the increase in SS is sufficiently high.

 Adjusted R-square = 1 - MS(Error) / [SS(Total)/(n-1)], n being the number of


design experiments

R-square prediction
R-square on the predicted values, which is most conservative of the three R-squares
and says something about the predictive ability of the model.

 R-square prediction = 1- PRESS / SS(Total)

S
Estimate for standard deviation (Root Mean Squared Error of Calibration; RMSEC)
Mean
Average value of the reference Y values on samples taking part in the analysis.
C.V. in %
Coefficient of variation is a normalized measure of dispersion of a probability
distribution. The standard deviation expressed as a percentage of the mean.

 C.V. in % = 100 * S / Mean

PRESS
PRediction Error Sum of Squares is an estimate of the dispersion of leverage
corrected residuals. It accounts for the predictive ability of the model in the sense
that each residual value is estimated as if the sample was left out from the model
calibration. The magnitude of this statistic can be compared with the corrected total
SS (the smaller the better).
ANOVA table

369
The Unscrambler X Main

Summary
The first part of the ANOVA table tests the significance of the model when all specified
effects are included. If the model p-value is small (e.g. smaller than 0.05), it means that the
model explains more of the variation in the response variable than could be expected from
random phenomena. In other words, the model is significant at the 5% level. The smaller the
p-value, the more significant (and useful) the model is.
Variables
The second part of the ANOVA table deals with each individual effect (main effects,
optionally also interactions and square terms). If the p-value for an effect is small, it explains
more of the variations of the response variable than could be expected from random
phenomena. The effect is significant at the 5% level if the p-value is smaller than 0.05. The
smaller the p-value, the more significant the effect is.
There are different ways to calculate sums of squares (SS), however for orthogonal designs
such as factorial designs they all give the same results. For non-orthogonal designs such as
D-optimal and mixture designs, this section tests the so-called Marginal (Type III) SS. This
corrects for the contribution of all other terms in the model irrespective of order, however
the individual contributions may not sum to the Model SS.

370
Design of Experiments

Model check
The model check tests whether it is beneficial to add terms of successively higher order to
the model. For orthogonal designs such as factorial designs, the individual contributions of
the terms of a particular order sum to the model check SS. If the p-value for a group of
effects is large it means that these terms do not contribute much to the model and that a
simpler model should be considered.
For D-optimal and mixture designs, the so-called sequential (Type I) SS is given in the Model
check section. Also higher order terms than the ones actually included in the model are
given here when relevant. This section will indicate the optimal complexity of the model
when adding terms in a hierarchical manner (i.e. lower order terms added before higher
order terms). If all tested terms are included in the model, the sum of contributions will
equal the Model SS.
Lack of fit
The lack of fit part tests whether the error in response prediction is mostly due to
experimental variability or to an inadequate shape of the model. If the p-value for lack of fit
is smaller than 0.05, it means that the model does not describe the true shape of the
response surface. In such cases, it may be helpful to apply a transformation to the response
variable.

Note:
 For screening designs, the model can be saturated. In such cases, one
cannot use the design samples for significance testing; the center samples
or reference samples are used.
 If the design has design variables with more than two levels, use the
Multiple Comparison plot and B-coefficient table in order to see which
levels of a given variable differ significantly from each other.
 Lack of fit can only be tested if the replicated center samples do not all
have the same response values (which may sometimes happen by
accident).

Diagnostics
This plot presents several values for assessing the quality of the fit of the model to each
individual response.
Standard Order
The standard order is the non-randomized order from the experiment generator
Actual Value
This is the measured response values as given in the design table.
Predicted Value
This is the fitted response value as calculated from the model.
Compare this value to the actual value; the closer those values are the better is the
fit to the model.
Residual
This is the difference between the actual and the predicted value.
Study all the values; the smaller they are the better is the fit by the model. Note that
it does not say anything about the predictive ability of the model when applied to
new samples.
Leverage

371
The Unscrambler X Main

The leverage is the distance of the projected samples to the center of the model. A
sample with high leverage is an influential sample or an outlier. Note that for
saturated models, the leverage is 1 for all samples and there is no residual DF to
estimate error in the model.
Student Residual
A studentized residual is the result from the division of a residual by the estimate of
the sample dependent standard deviation of the residual. The presented values are
the so-called internally studentized residuals, meaning that all samples have been
included in the estimation of the standard deviation. This statistic is can be used for
detection of outliers. For any reasonably sized experiment (e.g. n>30), 95% of
normally distributed, studentized residuals will fall in the interval [-2, 2].
Cook’s Distance
The Cook’s distance of an observation is a measure of the global influence of this
observation on all the predicted values. This is done by measuring the effect of
deleting this given observation. Data points with large residuals and/or high leverage
may distort the outcome and accuracy of a regression.
The Cook’s distance gives an actual threshold to judge the samples. Points with a
Cook’s distance of 1 or more are considered to be potential outliers.
Run Order
The run order is the (randomized) order of experimentation. There should not be a
run-order dependent trend in the above diagnostic tools.
Diagnostics

Effect visualization
This plot displays one effect at a time for a given response. To change the displayed effect
and the response click on the arrows or on one of the cells of the “Summary of the
effects” table.
It is useful to study the magnitude of the effects (change in the response value when the
design variable increases from Low to High) and the interactions.
There are two types of effects that can be visualized.
Main Effects
The plot shows the average response value for a specific response variable at the
Low and High levels of the design variable. If there are center samples, the average
response value for the center samples is also displayed. It is useful to study the
magnitude of the main effect (change in the response value when the design
variable increases from Low to High). If there are center samples, one can also
detect a curvature visually. For category variables with more than two levels, the
average response value for each category level is given.
Main effects with curvature

372
Design of Experiments

Interaction effects
The plot shows the average change in response values for a design variable
depending on the level of the other variable in a two-factor interaction. One line is
given for the Low level of the second design variable, and one line is given for the
High level of the second design variable.
It is possible to study the magnitude of the interaction effect (1/2 * change in the
effect of the first design variable when the second design variable changes from Low
to High).

 For a positive interaction, the slope of the effect for “High” is larger than for
“Low”;
 For a negative interaction, the slope of the effect for “High” is smaller than
for “Low”;
 For no interaction the curves are parallel.

Interaction Effects: No effect, Positive effect, Negative effect

Effect summary
This table plot gives an overview of the significance of all effects for all responses. There are
three values per effect and per response:

373
The Unscrambler X Main

 Significance: This coded value indicates if the effect is significant for the specific
response. The significance level is also reflected by the color of the row. See the
Significance levels and associated codes table below.
 Effect value: This is the value of the effect for the specific response variable.
 p-value: Result of the test of significance for the effect.

Effect Summary table

The sign and significance level of each effect is given as a code:


Significance levels and associated codes
P-value limits Negative effect Positive effect Color code

P > 0.10 NS NS red

0.05 < P <= 0.1 ? ? yellow

0.01 < P <= 0.05 – + light green

0.005 < P <= 0.01 – – ++ dark green

P <= 0.005 ––– +++ dark green


NS: non-significant. ?: Marginally significant (alpha-level 10%).
Look for rows which contain many ”+” or ”–” signs and are green: these main effects or
interactions are most important for explaining the variance of the response in question.
If the design contains category variables with 3 levels or more, the effects table is replaced
with a multiple comparison plot in the ANOVA overview.
Effect and B-coefficient overview
This overview is available for all designs that contain continuous or 2-level category variables
only. For category variables with 3 levels or more, no single regression coefficent or effect
can describe the variable in question and these plots would be less informative.

Regression coefficients and their confidence interval


This plot shows the value of the regression coefficients with their confidence intervals (CIs)
for one response variable.
The bigger the coefficient the more important the design variable for the response variable.
The smaller the CI the more accurate the coefficient.
Regression coefficients with their CI

374
Design of Experiments

Use the arrows to navigate from one response variable to another or click on the
Response variable to be plotted in the table Regression coefficient table.

B-coefficient table
This table presents the value of the B-coefficient for the associated design variables as well
as B0.
It also gives the 95% confidence interval for the B-coefficients. These values give an idea of
the accuracy of the estimate of the coefficients.
The p- and t-values are computed to test the null hypothesis, H0: the coefficient is equal to
0. Rejection of this hypothesis for a variable means that the variable is important for
describing the response in question. By comparing the t-value with its theoretical
distribution (Student’s T-distribution), the significance level of the studied effect is obtained.
The associated p-value represents the significance of the effect associated with the B-
coefficient. H0 can be rejected if the p-value is smaller than, say 5% (green color). This
implies that the effect in question is important for modelling the response.
B-coefficient table

Effect visualization
This plot is shown for all designs except mixture designs. For more information on this plot,
check the ANOVA overview section.

Effect summary
For more information on this plot, check the ANOVA overview section

375
The Unscrambler X Main

Residuals overview
These plots can be used to check the adequacy of the model or look for outliers, provided
that there are ample residual degrees of freedom left to study the residuals. If the model is
close to saturated, i.e. the number of effects is almost as high as the number of
observations, artificially structured residuals will result that cannot be interpreted properly.

Normal probability of Y-residuals


This is a normal probability plot of the residuals of all the modelled effects. If effects are well
modelled, the residuals should contain unstructured noise only. Effects in the upper right or
lower left of the plot that do not approximately follow a straight line going through the rest
of the points, deviate from the normal distribution. This is an indication that the model is not
describing the sample very well – it may be an outlier.
The abd sample in the plot below is a typical example of an outlier. In this particular
example, it was found that the reason was a mis-typed response for that sample. After
correction the residuals of both abd and cef looked more like random noise.
Normal probability of Y-residuals

Y-residuals vs. Y-predicted


This is a plot of Y-residuals against predicted Y values. If the model adequately predicts
variations in Y, any residual variations should be due to noise only, which means that the
residuals should be randomly distributed. If this is not the case, the model is not completely
satisfactory, and appropriate action should be taken. If strong systematic structures (e.g.
curved patterns) are observed, this can be an indication of lack of fit of the regression
model. The figure below shows a situation that strongly indicates lack of fit of the model.
This is typical for a model that would benefit from including quadratic terms.
Structure in the residuals

376
Design of Experiments

The presence of an outlier is shown in the example below. The outlying sample has a much
larger residual than the others; however, it does not seem to disturb the model to a large
extent.
A simple outlier has a large residual

The figure below shows the case of an influential outlier: not only does it have a large
residual, it also attracts the whole model so that the remaining residuals show a very clear
trend. Such samples should usually be excluded from the analysis, unless there is an error in
the data table that can be corrected.
An influential outlier changes the structure of the residuals

377
The Unscrambler X Main

Small residuals (compared to the variance of Y) which are randomly distributed indicate
adequate models.

Histogram of Y-residuals
This plot shows the distribution of the residuals, optionally with a statistics table displayed.
Histogram of Y-residuals

A symmetric bell-shaped histogram which is evenly distributed around zero indicates that
the normality assumption is likely to be true. This is the case in the above plot. Moderate
departures from normality is usually acceptable. Change the resolution of the histogram by
toggling the number of bars in the toolbar.

378
Design of Experiments

Y-residuals in experimental order


This plot is a line/bar plot of the Y-residuals in experimental order. It is used to detect if
there is a time-dependent trend in the experimentation. If the Y-residual increases with the
time of experimentation some non-randomized variationis occurring. The experimentation is
biased with a factor that varies with time. Try to identify it.
This plot can also detect if the variance/spread of the residuals changes over time, which
might violate the constant variance assumption.
Y-residuals in experimental order: No apparent time-effect (left), clear time-dependent effect

ANOVA table
For more information check the ANOVA overview section
Diagnostics
For more information check the ANOVA overview section
B-coefficients
This plot node is available for all designs except designs with categorical design variables
with three levels or more and for mixture designs.

Regression coefficients and their confidence interval


For more information on this plot, look at the section Effect and B-coefficient overview

B-coefficient table
For more information on this plot, look at the section Effect and B-coefficient overview
Effect visualization
This plot node is available for all designs except designs with categorical design variables
with three levels or more and for mixture designs.

Effect visualization
For more information check the DoE overview section

379
The Unscrambler X Main

Effect summary
For more information check the DoE overview section
Cube plot
This plot is available for all factorial designs (incl. Plackett-Burman). It displays the average of
a specified response variable at the experimental points.
Cube plot

The plot is most useful when there are two or three design variables. If there are more than
three design variables it is possible to choose which cube to represent using the arrows for
X, Y and Z.
Error table
The error table is a summary of the quality parameters available for the analysis of design
data. See ANOVA table for a description of the individual terms.
Error table

Predicted vs. Reference


This is a scatter plot of the predicted response values vs. the reference/measured values.
The better the fit, the closer the values will fall on a straight line. See section on calibration
values in Predicted vs. Reference for details.
Predicted vs. Reference plot

380
Design of Experiments

Response surface
There are two types of response surface (RS) plots. A square response surface is given for
non-mixture designs and a triangular response surface is given for mixture designs.

Response surface plot


This plot is used to find the levels of the design variables that will give an optimal response,
and to study the general shape of the response surface. It shows the response surface for
one response variable at a time.
Look at this plot as a map which tells how to reach the experimental objective. Two design
variables are studied over their range of variation; the remaining ones are by default held
constant at their mean level. The levels of the non-plotted variables can be tuned in the RS
table. For mixture designs, three components are plotted, and the response surface has a
simplex (triangular) shape.
The response surface is initially viewed from the top, i.e. the axis showing the predicted
response points out from the plot and contour lines indicate where the predicted response
has the same value. Pointing the cursor anywhere in the design region will show the
coordinate values as well as the predicted response value for that point. A color-bar
translates the colors into levels of response values.
Response surface plot

381
The Unscrambler X Main

The response surface can also be rotated and viewed in 3D from any angle using the mouse:
Rotated response surface plot

382
Design of Experiments

Different representations of the response surface can be seen selecting the options in tool
bar for Mesh, Floor Contour or Surface Contour.
Response surface right click options
The following options are available from the right click menu in a response surface plot.

From the DOE menu all available analysis plots can be accessed.
Click View to switch between Graphical or Numerical view (also accessible from the toolbar),
or to toggle the colorbar (Legend) on or off.
Copy a bitmap representation to the clipboard for pasting into other applications, or Save
Plot using either of the formats JPEG, PNG, BMP, PNM or TIFF.

383
The Unscrambler X Main

The Auto Scale option available from the right click or toolbar menu will return to a default
size 2D-plot.
The following Properties can be tuned from the plot properties dialog:
Appearance

 The contour count: The number of contour lines on the plot

 FloorContour: Toggle display of contour lines below the response surface on


or off. Also accessible as a toolbar check box when the response surface plot
is active.

 Mesh: Toggle display of a rectangular grid on the response surface on or off.


Also accessible as a toolbar check box when the response surface plot is
active.

 SurfaceContour: Toggle display of response surface contour lines on or off.


Also accessible as a toolbar check box when the response surface plot is
active.

Plot Font

 Bold: Toggle bold font for title, axis, colorbar and tooltip text on and off.

 Italic: Toggle italic font for title, axis, colorbar and tooltip text on and off.

 Name: Switch between font families Arial, Courier and Times for title, axis,
colorbar and tooltip text.

 Size: Set font size as a relative number. The plotting library automatically
attempts to find the best font size for different text. You may increase or
decrease the size of all plot text within the range of 0.1 (very small) and 4.0
(very large).

Response surface table


This table is used to select design and response variables to plot, to set the levels of non
plotted factors and optionally to impose optimization constraints on any of the design or
response variables. The latter is a very useful tool to find the optimum level combinations
for one or more responses. By imposing constraints on multiple responses simultaneously
and overlaying them in the same plot, it can immediately be seen which level combinations
are allowed and which fall outside of the (tuneable) optimization regions.
Design variables
In a response surface for non-mixture designs two design variables are plotted while
the others are fixed. For mixture designs, three mixture components are plotted in a
simplex plot.
To select the variables to plot, tick/uncheck the box in the Display column.
Optimization constraints for design variables can be set using the sliders or manually
enter the values in the Min and Max columns. The area outside of the selected
design region will be grayed out.

384
Design of Experiments

To set the level of the non-plotted variables enter the value manually in the column
Current. By default this value is the average value.
For mixture designs the levels of the components cannot vary independently of each
other, as the mixture constraint imposes that all components must sum to the
Mixture Sum always. Therefore, if a non-plotted variable is tuned, the axes and Max
levels of the plotted variables are updated accordingly. A minimum Max value
corresponding to 3.5% of the total range is enforced for plotted mixture
components.
For mixture designs there is an additional column with Freeze check-boxes. This is
useful for designs with 5 components or more. If the current level of a non-plotted
mixture variable is increased until the plotted variable axes cannot be reduced any
more, the levels of other non-plotted components will be reduced instead. If freeze
is checked for a non-plotted variable, its current value cannot be changed due to a
change in other variables.
For category variables select one of the levels using the drop-down list.
Response variables
Only one response variable can be plotted at a time. Select the response to plot by
ticking the variable of interest.
Optimization constraints for response variables can be set using the sliders or
manually enter the values in the Min and Max columns. Setting optimization
constraints for multiple responses simultaneously is a very useful tool for finding the
optimal design settings.
Response surface table

Multiple comparison
This node is given for non-saturated designs with at least one category variable. It shows
whether the distance between levels is larger than a critical distance, in which case the
levels are considered to belong in different groups. Because the critical distance is calculated
from the data, residual degrees of freedom are required for these plots to be displayed.

Multiple comparison plot


This is a comparison of the average of a given response variable for the different levels of a
design variable. It shows whether any of the levels are associated with a higher or lower
mean response compared to the other levels.
This plot displays one design variable and one response variable at a time. Use the the
toolbar arrows to switch between category variables and the toolbar drop-down
box for changing the response variable to display. If there is significant difference between

385
The Unscrambler X Main

one categorical level and the other levels, the average response values are plotted in
different groups along the X-axis.
Multiple Comparisons

 The average response value is displayed as a red square and its value can be read on
the vertical axis or by mouse-over.
 The levels are grouped along the horizontal axis by significantly different groups.
 The names of the different levels can be seen by mouse-over.
 Levels that are not significantly different are linked by blue vertical bars. Each
vertical bar is the size of half the critical distance. Two levels have significantly
different average response values if they are not linked by any bar.
 The critical distance is indicated in the x-axis title.

Group table
The group table shows the levels associated with the different groups. This table takes the
value 1 if the level is part of the specified group and 0 if not. One level can be associated
with several groups.
Group table

Distance table
This table shows for a specific response variable and a specific category variable the distance
between the average value of two-by-two levels.
Distance table

386
Design of Experiments

B-coefficient table
For more information look at the description in the B-coefficients section. If one of the
categorical variables has three levels or more, an Effect visualization is plotted instead of the
B-coefficient table.

8.8.3 Available plots for Partial Least Squares Regression (DoE PLS)
When PLSR is performed on designed data all the regular PLSR plots are available. The DoE
PLS in addition has some plots useful for DOE purposes.
Overview

Weighted regression coefficients


This plot displays the weighted regression coefficients for the optimal number of factors
with their uncertainty limits.
Stable weighted B-coefficients show an uncertainty limit that does not cross the 0-line.
Regression coefficients

Explained Variance
This is the total explained variance plot for models of an increasing number of components.
Use the toolbar buttons to switch between X-/Y-variance, calibration/validation variance and

387
The Unscrambler X Main

explained/residual variance. The validation variance in DoE-PLSR is based on a full (leave-


one-out) cross-validation.
Refer to the Explained variance plot in PLSR for more details.

PLSR ANOVA p-values


This plot displays the p-values obtained from the uncertainty test of regression coefficients.
Small p-values indicate model terms that most likely have an important effect on the
response.
Four significance levels are given at 0.01 (dark green), 0.05 (light green), 0.1 (yellow) and 0.2
(red). Terms with p-values lower than one of the lines are significant at the corresponding
level.
PLSR ANOVA p-values

Predicted vs. Reference


By default the Predicted vs. Reference plot shows the results for the first Y-variable. To see
the results for other Y-variables, use the arrows next to the response values. In
addition by default the results are shown for a specific number of factors (or PCs), that
should reflect the dimensionality of the model. If the number of factors (or PCs) is not

satisfactory, it is possible to change it by using the PC icon .


The selected predicted Y-value from the model is plotted against the measured Y-value. This
is a good way to check the quality of the regression model. If the model gives a good fit, the
plot will show points close to a straight line through the origin and with slope close to 1.

388
Design of Experiments

Turn on Plot Statistics (using the View menu) to check the slope and offset, RMSEP/RMSEC
and R-squared. Generally all the y-variables should be studied and give good results.
Note: Before interpreting the plot, check whether the plots are displaying
Calibration or Validation results (or both).
Menu option Window - Identification tells whether the plots are displaying Calibration (if
Ordinate is yPredCal) or Validation (yPredVal) results.
Use the buttons to switch Calibration and Validation results off or on.
It is also useful to show the regression line using the icon , and compare it with the

target line that is enabled with the icon .


Some statistics are available giving an idea of the quality of the regression. They are

available from the icon


When Calibration and Validation results are displayed together as shown in the figure below,
pay special attention to:
Differences between Cal and Val
If there are large differences, the model cannot be trusted.
R-squared
The first one (in blue) is the raw R-squared of the model, the second one (in red) is
also called adjusted R-squared and tells how good a fit can be expected for future
predictions. R-squared varies between 0 and 1. A value of 0.9 is usually considered
as pretty good but this varies depending on the application and on the number of
samples.
RMSE
The first one (in blue) is the Calibration error RMSEC, the second one (in red) is the
expected Prediction error RMSEP. Both are expressed in the same unit as the
response variable Y.
Predicted vs. Reference plot for Calibration and Validation, with Plot Statistics turned on, as
well as Regression line and Target line.

How to detect cases of good fit / poor fit

389
The Unscrambler X Main

The figures below show two different situations: one indicating a good fit, the other a poor
fit of the model.
Predicted vs. Reference shows how well the model fits

Left: Good fit. Right: Poor fit.


How to detect outliers
One may also see cases where the majority of the samples lie close to the line while a few of
them are further away. This may indicate good fit of the model to the majority of the data,
but with a few outliers present (see the figure below).
Detecting outliers on a Predicted vs. Reference plot

In the above plot, sample 3 does not follow the regression line whereas all the other
samples do. Sample 3 may be an outlier.
How to detect nonlinearity
In other cases, there may be a nonlinear relationship between the X- and Y-variables, so that
the predictions do not have the same level of accuracy over the whole range of variation of
Y. In such cases, the plot may look like the one shown below. Such nonlinearities should be
corrected if possible (for instance by a suitable transformation), because otherwise there
will be a systematic bias in the predictions depending on the range of the sample.
Predicted vs. Reference shows a nonlinear relationship

390
Design of Experiments

PLS-ANOVA Summary table


This table presents the effect values for all variables as well as their significance levels and p-
values.
PLSR-ANOVA Summary table

Significance levels and associated codes


P-value Negative effect Positive effect Color code

>= 0.10 NS NS red

[0.10:0.05] ? ? yellow

[0.01:0.05] – + light green

[0.005:0.01] – – ++ light green

\< 0.005 ––– +++ dark green


NS: non significant.
?: possible effect at the significance level 10%.

391
The Unscrambler X Main

Normal probability plot


This is a normal probability plot of the Y-residuals after a given number of components. As
residuals are supposed to contain little or no structured variation, all the points should
ideally fall close to a straight line. See Normal probability of Y-residuals for more details.
X- and Y-Loadings
A 2-D scatter plot of X- and Y-loadings for two specified components (factors) from PLSR is a
good way to detect important variables and relationships between variables. The plot is
most useful for interpreting component 1 vs. component 2, since they represent the largest
variations in the X-data that explain the largest variation in the Y-data. By default both Y-
and X-variables are displayed but it is possible to modify that by clicking on the X and Y
icons.
Interpret the X-Y relationships
To interpret the relationships between X and Y-variables, start by looking at the response (Y)
variables.

 Predictors (X) projected in roughly the same direction from the center as a response,
are positively linked to that response. In the example below, predictors sweet, red
and color have a positive link with response Pref.
 Predictors projected in the opposite direction have a negative relationship, as
predictor thick in the example below.
 Predictors projected close to the center, as bitter in the example below, are not well
represented in that plot and cannot be interpreted.

Cheese experimentation: Six responses (Adhesiveness, Stickiness, Firmness, Shape retention,


Glossiness, Meltiness), four process predictors (Amount of dry matter, maturity, pH and
addition of recycled dry matter)

The maturity has a negative effect on the adhesiveness of the cheese; they are anti-
correlated. The amount of Dry matter affects positively the stickiness and negatively the
glossiness and meltiness. Glossiness and meltiness, two responses, are correlated.

392
Design of Experiments

Caution! If the X-variables have been standardized, one should also standardize the
Y-variable so that the X- and Y-loadings have the same scale; otherwise the plot
may be difficult to interpret.
The plot shows the importance of the different variables for the two components specified.

It is possible to change the display by using the PC drop-down list . It should


preferably be used together with the corresponding scores plot. Variables with loadings to
the right in the loadings plot will be X-variables which usually have high values for samples
to the right in the scores plot, etc. This plot can be used to study the relationship between
the X-variables and the X- and Y-variables.
If the Uncertainty test was activated the important variables will be circled. It is also possible
to mark them by using the icon .
Loadings plot with circled important variables

Note: Downweighted variables are displayed in a different color so as to be easily


identified.

Correlation loadings emphasize variable correlations


When a PLSR analysis has been performed and a two-dimensional plot of loadings is
displayed on the screen, the Correlation Loadings option (available from the View menu and
the icon can be used to aid in the discovery of the structure in the data. Correlation
loadings are computed for each variable for the displayed factors. In addition, the plot
contains two ellipses to help check how much variance is taken into account. The outer
ellipse is the unit-circle and indicates 100% explained variance. The inner ellipse indicates
50% of explained variance. The importance of individual variables is visualized more clearly
in the correlation loadings plot compared to the standard loadings plot.

393
The Unscrambler X Main

Correlation Loadings of process variables (X) and the quality of the cheese (Y) along (factor
1,factor 2)

Variables close to each other in the loadings plot will have a high positive correlation if the
two components explain a large portion of the variance of X. The same is true for variables in
the same quadrant lying close to a straight line through the origin. Variables in diagonally
opposed quadrants will have a tendency to be negatively correlated. For example, in the
figure above, variables dry matter and stickiness have a high positive correlation on factor 1
and factor 2, and they are negatively correlated to variables meltiness and glossiness.
Variables adhesiveness and stickiness have independent variations. Variables addition of
recycled dry matter and pH are very close to the center, they are not well described by
factor 1 and factor 2.
Note: Variables lying close to the center are poorly explained by the plotted factors
(or PCs). They cannot be interpreted in that plot.

8.9. DOE method reference


This document, which can be downloaded from our web site, details the algorithms used in
The Unscrambler® as well as some statistical measures and formulas.
http://www.camo.com/helpdocs/The_Unscrambler_Method_References.pdf

8.10. Bibliography
R. C. Bose and K. Kishen, On the problem of confounding in the general symmetrical factorial
design, Sankhya, 5, 21, (1940).
J.A. Cornell, Experiments with Mixtures: Designs, Models, and the Analysis of Mixture Data,
Second edition, John Wiley and Sons, New York, 1990.
G.H. Golub and C.F. Van Loan, Matrix Computations, Third edition, Johns Hopkins University
Press, 1996.
R.W. Kennard and L.A. Stone, Computer Aided Design of Experiments, Technometrics, 11(1),
137-148, (1969).
G.A. Lewis, D. Mathieu, and R. Phan-Tan-Lu, Pharmaceutical Experimental Design, Marcel
Dekker, Inc., New York, 1999.

394
Design of Experiments

D.C. Montgomery, Design and Analysis of Experiments, Sixth edition, John Wiley & Sons,
New York, 2004.
R.H. Myers and D.C. Montgomery, Response Surface Methodology: Process and Product
Optimization using Designed Experiments, Second edition, Wiley, New York, 2002.
T. Naes and T. Isaksson, Selection of Samples for Calibration in Near-Infrared Spectroscopy.
Part I: General Principles Illustrated by Example, Appl. Spectrosc., 43(2), 328-335, (1989).
N.-K. Nguyen and G.F. Piepel, Computer-Generated Experimental Designs for Irregular-
Shaped Regions, QTQM, 2(2), 147-160, (2005).
R.E.A.C. Paley, On orthogonal matrices, J. Math. Phys., 12, 311–320, (1933).
R.L. Plackett and J.P. Burman, The Design of Optimum Multifactorial Experiments,
Biometrika, 33, 305-25, (1946).
H. Scheffé, Experiment with Mixtures, J. Roy. Stat. Soc. Ser. B, 20, 344-366, (1958).

395
9. Validation
9.1. Validation
Model validation is performed for PCA or regression models to estimate how useful the
model will be for future observations. It returns the predictive ability of the model as
opposed to the model’s fit to the training data.

 Theory
 Dialog usage: Validation tab
 Dialog usage: Cross validation setup

9.2. Introduction to validation


Validating a model based on empirical data means checking how well the model will perform
on new data of the same kind that was used in developing the model. The validation of a
model estimates the uncertainty of future predictions that may be made with the model. If
the uncertainty is reasonably low, the model can be considered valid. However, regression
methods are also applied for modeling relations between blocks of data without any
objective of implementing the model in a process or in an instrument.
This chapter presents the purposes and principles of model validation in multivariate data
analysis.

 Principles of model validation


 What is validation?
 Test set validation
 How to select a test set
 Cross validation
 Leverage correction
 Validation results
 When to use which validation method
 Uncertainty testing with cross validation
 How does the uncertainty test work?
 Uncertainty of regression coefficients
 Uncertainty of loadings and loading weights
 Stability plots
 Easier to interpret important variables in models with many
components
 Remove non-significant variables for more robust models
 Application areas
 More details about the uncertainty test
 Model validation check list

9.2.1 Principles of model validation


To keep this discussion as general as possible, it is written with focus on the case of a
regression model. However, the same principles apply to PCA and other methods.
For the case of validation of PCA results:

397
The Unscrambler X Main

 Disregard any mention of “Y-variables”.


 Disregard the sections on RMSEP.

9.2.2 What is validation?


Validating a model based on empirical data means checking how well the model will perform
on new data.
A regression model is often made to do predictions in the future. The validation of the
model estimates the uncertainty of such future predictions. If the uncertainty is reasonably
low, the model can be considered valid. However, regression methods are also applied for
modeling relations between blocks of data without any objective of implementing the model
in a process or in an instrument.
The same argument applies to a descriptive multivariate analysis such as PCA: If the
objective of the PCA is to extrapolate the correlations observed in the data table to future,
similar data, one should check whether they still apply for new data.
In The Unscrambler® three methods are available to estimate the model stability and
prediction ability: test set validation, cross validation and leverage correction.
Test set validation
Test set validation is based on testing the model on a subset of the available samples, which
will not be present in the computations of the model parameters.
The global data table is split into two subsets:
Calibration set
contains all samples used to compute the model components, using X- and Y-values;
Test set
contains all the remaining samples, for which X-values are fed into the model once a
new component has been computed. Their predicted Y-values are then compared to
the observed Y-values, yielding a prediction residual that can be used to compute a
validation residual variance or an RMSEP.

How to select a test set


A test set should contain 20-40% of the full data table. The calibration and test set should in
principle cover the same population of samples as well as possible. Samples which can be
considered to be replicate measurements should not be present in both the calibration and
test set.
There are several ways to select test sets:
Manual selection
is recommended since it gives one full control over the selection of a test set;
Random selection
is the simplest way to select a test set, but leaves the selection to the computer;
Group selection
makes it possible for the user to specify a set of samples as test set by selecting a
value or values for one of the variables. This should only be used under special
circumstances. An example of such a situation is a case where there are two true
replicates for each data point, and a separate variable indicates which replicate a
sample belongs to. In such a case, one can construct two groups according to this
variable and use one of the sets as test set. The group can be selected from one
chosen level of a category variable.

398
Validation

Cross validation
Though the objective is to have enough samples to put a reasonable amount aside as a test
set, this is not always possible due, for example, to the cost of samples or reference testing.
The best alternative to an independent test set for validation is to apply cross validation.
With cross validation, the same samples are used both for model estimation and testing. A
few samples are left out from the calibration data set and the model is calibrated on the
remaining data points. Then the values for the left-out samples are predicted and the
prediction residuals are computed. The process is repeated with another subset of the
calibration set, and so on until every object has been left out once; then all prediction
residuals are combined to compute the validation residual variance and RMSEP. It is of
utmost importance that the user is aware of which level of cross validation he wants to
validate. For example, if one physical sample is measured three times, and the objective is to
establish a model across samples, the three replicates must be held out in the same cross
validation segment. If the objective is to validate the repeated measurement, keep out one
replicate for all samples and generate three cross validation segments. The calibration
variance is always the same; it is the validation curve that is the important figure of merit
(and the RMSECV for regression models).
Several versions of the cross validation approach can be used:
Full cross validation
leaves out only one sample at a time; it is the original version of the method;
Segmented cross validation
leaves out a whole group of samples at a time. A typical example is when there are
systematic replicated measurements of one physical sample;
Test-set switch
divides the global data set into two subsets, each of which will be used alternatively
as calibration set and as test set;
Category variable
enables the user to validate across levels of category variables. This is useful for
evaluating how robust the model is across season, raw material supplier, location,
operator, etc.
When running a cross validation, one can get prediction diagnostics for the cross validation
segments. These are not available when full cross-validation is used. This option will provide
information on the validation results per each cross validation segment including RMSEP,
SEP, bias, slope, offset and correlation. The CV prediction diagnostics are added as a matrix
in the Validation folder of the PLSR model.
Leverage correction
Leverage correction is an approximation to cross validation that enables prediction residuals
to be estimated without actually performing any prediction. It is based on an equation that
is valid for MLR, but is only an approximation for PLSR and PCR.
According to this equation, the prediction residual equals

(calibration residual) divided by (1 - sample leverage)


All samples with low leverage (i.e. low influence on the model) will have estimated
prediction residuals very close to their calibration residuals (the leverage being close to
zero). For samples with high leverage, the calibration residual will be divided by a smaller
number, thus giving a much larger estimated prediction residual.
In the earlier days of multivariate modeling, when computer power was a fraction of what it
is today, this method was applied in the initial modeling. Nowadays, the user typically has

399
The Unscrambler X Main

the possibility to perform cross validation for most data sets without much computation
time, making the leverage correction more of a relic of the old days.

9.2.3 Validation results


The simplest and most efficient measure of the uncertainty on future predictions is the
RMSEP. This value (one for each response) is a measure of the average uncertainty that can
be expected when predicting Y-values for new samples, expressed in the same units as the
Y-variable. The results of future predictions can then be presented as predicted values ±
2*RMSEP. This measure is valid provided that the new samples are similar to the ones used
for calibration, otherwise, the prediction error might be much higher.
Validation residual and explained variances are also computed in exactly the same way as
calibration variances, except that prediction residuals are used instead of calibration
residuals. Validation variances are used, as in PCA, to find the optimum number of model
components. When validation residual variance is minimal, RMSEP also is, and the model
with an optimal number of components will have the lowest expected prediction error.
RMSEP can be compared with the precision of the reference method. Usually one cannot
expect RMSEP to be lower than twice the precision.

9.2.4 When to use which validation method

Properties of test set validation


Test set validation can be used if there are many samples in the data table, for instance
more than 50.
It is the most “objective” validation method, since the test samples have no influence on the
calibration of the model.

Properties of cross validation


Cross validation represents a more efficient way of utilizing the samples if the number of
samples is small or moderate.
Segmented cross validation is the fast approach, but full cross validation is also often
applied. The suggested rule of thumb is to do random 10-segment cross validation if there is
no reason to divide the samples into subgroups.
When using segmented cross validation, make sure that all segments contain unique
information, i.e. samples which can be considered as replicates of each other should not be
present in different segments.
The major advantage of cross validation is that it allows for the jack-knifing approach on
which an Uncertainty Test is based. This provides significance testing for PLSR results. For
more information, see Uncertainty testing with cross validation.

Properties of leverage correction


Leverage correction for projection methods should only be used in an early stage of the
analysis if it is very important to obtain a quick answer. In general it gives more “optimistic”
results than the other validation methods and can sometimes be highly overoptimistic.
Sometimes, especially for small data tables, leverage correction can give apparently
reasonable results, while cross validation fails completely. In such cases, the “reasonable”
behavior of the leverage correction can be an artifact and cannot be trusted. The reason

400
Validation

why such cases are difficult is that there is too little information for estimation of a model
and each sample is “unique”. Therefore all known validation methods are doomed to fail.
For MLR, leverage correction is strictly equivalent to (and much faster than) full cross
validation.

9.2.5 Uncertainty testing with cross validation


Users of multivariate modeling methods are often uncertain when interpreting models.
Frequently asked questions are:

 Which variables are significant?


 Is the model stable?
 Why is there a problem?

Dr. Harald Martens has (re-)developed a generic method for uncertainty testing, which gives
a safer interpretation of models. The concept for uncertainty testing is based on cross
validation, jack-knifing and stability plots. This section introduces how the Uncertainty Test
works and shows how it can be used in The Unscrambler® through an application.
The following sections will present the method with a non-mathematical approach.
How does the uncertainty test work?
The test works with PLSR or PCA models with cross validation, choosing full cross validation
or segmented cross validation as is appropriate for the data. When the optimal number of
components (factors) for PLSR have been chosen, tick Uncertainty test on the validation tab
of The Unscrambler® modeling dialog box.
Under cross validation, a number of submodels are created. These submodels are based on
all the samples that were not kept out in the cross validation segment. For every submodel,
a set of model parameters: B-coefficients, loadings and loading weights are calculated.
Variations over these submodels will be estimated so as to assess the stability of the results.
In addition a total model is generated, based on all the samples. This is the model that will
be used for interpretation.

Uncertainty of regression coefficients


For each variable one can calculate the difference between the B-coefficient Bi in a
submodel and the Btot for the total model. The Unscrambler® takes the sum of the squares of
the differences in all submodels to get an expression of the variance of the Bi estimate for a
variable.
With a t-test the significance of the estimate of Bi is calculated. Thus the resulting regression
coefficients can be presented with uncertainty limits that correspond to 2 standard
deviations under ideal conditions. Variables with uncertainty limits that do not cross the zero
line are significant variables.

Uncertainty of loadings and loading weights


The same can be done for the other model parameters, but there is a rotational ambiguity in
the latent variables of bilinear models. To be able to compare all the submodels correctly,
they are rotated back to the main model before the uncertainty is estimated. Therefore one
can also get uncertainty limits for these parameters.

401
The Unscrambler X Main

Stability plots
The results of all these calculations can also be visualized as stability plots in scores, loadings,
and loading weights plots. Stability plots can be used to understand the influence of specific
samples and variables on the model, and explain for example why a variable with a large
regression coefficient is not significant. This will be illustrated in the example that follows
(see Application Example).

Easier to interpret important variables in models with many components


Models with many components, three, four or more, may be difficult to interpret, especially
if the first factors (PCs) do not explain much of the variance.
For instance, if each of the first 4-5 PCs explain 15-20% of the variance, the factor 1/factor 2
plot is not enough to understand which are the most important variables.
In such cases, Martens’ automatic uncertainty test shows the significant variables in the
many-component model and interpretation is far easier.

Remove non-significant variables for more robust models


Variables that are non-significant display non-structured variation, i.e. noise. When these
variables are removed, the resulting model will be more stable and robust (i.e. less sensitive
to noise). Usually the prediction error decreases too.
Therefore, after identifying the significant variables by using the automatic marking based
on Martens’ test, use The Unscrambler® function Recalculate with Marked (Right click on
equation node in project navigator, and select Recalculate- With Marked…) to make a new
model and check the improvements.
Application areas
Spectroscopic calibration works better if noisy wavelengths are removed.
Some models (not spectroscopic) may be improved by adding interactions and squares of
the variables, and The Unscrambler® has a feature to do this automatically. However, many
of these terms are irrelevant. Apply Martens’ uncertainty test to identify and keep only the
significant ones.

9.2.6 More details about the uncertainty test


One of the critiques towards PLS regression has been the lack of significance of the model
parameters. Many years of experience have given “rules of thumb” of how to find which
variables are significant. However, these “rules of thumb” do not apply in all cases, and the
users still see the need for easy interpretation and guidance in these matters. The data
analysis must give reasonable protection against wishful thinking based on spurious effects
in the data. To be effective, such statistical validation must be easily understood by its user.
The modified Jack-knifing method implemented in The Unscrambler® has been invented by
Harald Martens, and was published in Food Quality and Preference Martens (1999). Its
details are presented hereafter.
Note: To understand this chapter requires a basic knowledge about the purposes
and principles of chemometrics. For those who have never worked with
multivariate data analysis before, it is strongly recommended that they begin
reading about it in the chapters about PCA and regression before proceeding with
this chapter.

402
Validation

See tutorial M to learn how to use the Uncertainty Test results in practice.

New assessment of model parameters


The cross validation assessment of the predictive validity is here extended to uncertainty
assessment of the individual model parameters: In each cross validation segment
m=1,2,…,M a perturbed version of the structure model described is obtained.
For more details refer to the method references chapter.
Each perturbed model is based on all the objects except one or more objects which were
kept ‘secret’ in this cross validation segment m.
If a perturbed segment model differs greatly from the common model, based on all the
objects, it means that the object(s) kept ‘secret’ in this cross validation segment have
significantly affected the common model. These left out objects caused some unique pattern
of variation in the model parameters. Thus, a plot of how the model parameters are
perturbed when different objects are kept ‘secret’ in the different cross validation segments
m=1,2,…,M shows the robustness of the common model against peculiarities in the data of
individual objects or segments of objects.
These perturbations may be inspected graphically in order to acquire a general impression of
the stability of the parameter estimates, and to identify dominating sources of model
instability. Furthermore, they may also be summarized to yield estimates of the
variance/covariance of the model parameters.
This is often called “jack-knifing”. It will here be used for two purposes:
 Elimination of useless variables, based on the linear parameters B;
 Stability assessment of the bilinear structure parameters T and P’, Q’.

Rotation of perturbed models


It is also important to be able to assess the bilinear score and loading parameters. However,
the bilinear structure model has a related rotational ambiguity in the latent variables that
needs to be corrected for in the jack-knifing. Only then is it meaningful to assess the
perturbations of scores Tm and loadings Pm and Qm in cross validation model segment # m.
Any invertible matrix Cm (AxA) satisfies the relationships:

Therefore, the individual models m=1,2,…,M may be rotated, e.g. towards a common model:

After rotation, the rotated parameters T(m) and [P’,Q’](m) may be compared to the
corresponding parameters from the common model T and [P’,Q’]. The perturbations may
then be written as (T(m) - T)g and or ([P’,Q’](m) - [P’, Q’])g for the scores and the loadings,
respectively, where g is a scaling factor (here: g=1).
In the implemented code, an orthogonal Procrustes rotation is used. The same rotation
principle is also applied for the loading weights, W, where a separate rotation matrix is
computed for W. The uncertainty estimates for P, Q and W are estimated in the same
manner as for B below.

403
The Unscrambler X Main

Eliminating useless variables


On the basis of such jack-knife estimates of the uncertainty of the model parameters,
useless or unreliable X-or Y-variables may be eliminated automatically, in order to simplify
the final model and making it more reliable. The following part describes the cross validation
/ jack-knifing procedure:
When cross validation is applied in regression, the optimal rank A is determined based on
prediction of kept-out objects (samples) from the individual models. The approximate
uncertainty variance of the PCR and PLS regression coefficients B can be estimated by jack-
knifing

where

 s²(B) (K x J) = estimated uncertainty variance of B


 B (K x J) = the regression coefficient at the cross validated rank A using all the N
objects,
 Bm (K x J) = the regression coefficient at the rank A using all objects except the
object(s) left out in cross validation segment m
 g = scaling coefficient (here: g=(M-1)(M), where M is the number of cross-validation
segments).

Significance testing
When the variances for B, P, Q, and W have been estimated, they can be utilized to find
significant parameters.
As a rough significance test, a Student’s t-test is performed for each element in B relative to
the square root of its estimated uncertainty variance S²B, giving the significance level for
each parameter. In addition to the significance for B, which gives the overall significance for
a specific number of components, the significance levels for Q are useful to find in which
components the Y-variables are modeled with statistical relevance.

9.2.7 Model validation check list


In The Unscrambler® validation is always automatically included in model computation.
However, what matters most is the choice of a relevant validation method for the particular
case (data set) being studied, and the configuration of its parameters.
The general validation procedure for PCA and regression is as follows:
Build a first model
Use segmented cross validation or leverage correction — the computations will go
faster. Allow for a large number of factors. Cross validation is recommended as it
also gives the ability to apply Martens’ Uncertainty Test.
Diagnose
the first model with respect to outliers, nonlinearities, any other abnormal behavior.
Take advantage of the variety of diagnostic tools available in The Unscrambler®
variance curves, automatic warnings, scores and loadings, stability plots, influence
plot, X-Y relation outliers plot, etc.
Investigate and fix problems

404
Validation

Correct errors, apply transformations, etc.


Check improvements
by building a new model.
For regression only: validate intermediate model with a full cross validation, using
Uncertainty Testing, then do variable selection based on significant regression
coefficients.
Validate final model
with a proper method (test set or full cross validation).
Interpret final model
in terms of sample properties, variable relationships, etc. Check RMSEP for
regression models.

9.3. Validation tab


Menu options, dialogs, plots for validation.

9.3.1 Analysis and validation procedures


Validation is configured via the Validation tab for the respective analysis methods on the
Tasks - Analyze menu where one may choose a validation method and further specify
validation details.

 Principal Component Analysis (PCR)


 Multiple Linear Regression (MLR)
 Principal Component Regression (PCR)
 Partial Least Squares regression (PLSR)
 Support Vector Machine Regression (SVMR)
 Support Vector Machine Classification (SVMC)
 Linear Discriminant Analysis (LDA)

405
The Unscrambler X Main

9.3.2 Validation methods

The methods available for validation include:


Leverage Correction
A method used as a first pass model check. This should not to be used as a final
model validation method, as it an overly optimistic approximation.
Cross Validation
This method is used when either there are not enough samples available to make a
separate test set, or for simulating the effects of different validation test cases, e.g.
systematically leaving samples out vs. randomly leaving samples out, etc.
See Cross validation setup dialog usage
Test matrix
This is also known as Test Set Validation, and uses independent samples that have
not taken part in the calibration for validation. This allows one to define either a
new matrix, of the same number of variables, or a defined range within a single
matrix to be used as an independent check of model performance. Both X- and Y-
matrices need to be defined in this case. This is the preferred method for validation
and should be aimed for.

406
Validation

Prediction diagnostics for CV segments


When running a cross validation with a PLSR or PCR regression, one can select to also
compute the prediction diagnostics for the cross validation segments by checking this
selection in the dialog. These are not available when full cross-validation is used. This option
will provide information on the validation results per each cross validation segment
including RMSEP, SEP, bias, slope, offset and correlation. The CV prediction diagnostics are
added as a matrix in the Validation folder of the PLSR model.

Significance testing
The Uncertainty Test option can be used to estimate the significance of variables, when
using cross validation. During cross validation, the differences between the model
parameters for all samples and the model for the samples in this particular cross validation
segment is squared and summed. The significance (p-value) is estimated by a t-test with the
model parameter and its standard deviation as input. For PCA the p-values for loadings per
variable and component are returned. For PLS regression p-values are returned for x-
loadings, loading weights, y-loadings and regression coefficients.
This is referred to as Martens’ Uncertainty Test.

Details: Test matrix setup


Multiple Linear Regression Test Matrix Setup

Use the Matrix drop-down list to select the test set, or define it using the Rows and Column
selector drop-down lists to define a test set within a selected matrix for both X and Y.

407
The Unscrambler X Main

Discard Residuals option (PCA/PCR and PLSR models only)


In The Unscrambler® X all results from the modeling are stored to have the maximum
flexibility in plotting any result matrix in any way to make the right decision regarding
outliers, interpretation of the model etc. However, as the size of data matrices become
large, the residual matrices use a lot of available memory and disk space, resulting in the size
of the Unscrambler project becoming large and sometime unmanageable. To enable the
user to reduce the size of models, there is an option for PCA, PCR and PLSR to discard

residuals
By discarding residuals, the matrices

 X-Residuals
 X-Validated Residuals
 Y-Residuals
 Y-Validated Residuals

are removed from the Validation folder in the analysis. These are 3-Dimensional matrices
and use up a lot of memory. As in indication of the reduced size when enabling Discard
Residuals, A PLS regression model with 400 samples and 100 x-variables, 1 y-variable and 10
factors will only take up 10% of the Full model size. As the number of samples, X- and Y-
variables and factors increase, the reduced-size model will be even smaller in percentage of
the full model.
Note: When the residuals are discarded, some of the plot options will not be available. All
plots where the data are taken from the X-Residuals or Y-Residuals matrices will not be
listed in the plot menus. The Plot - Residuals sub menus now only allows Residuals and
Influence (with Q-residuals), and under Plot -Residuals -General only Influence Plot and
Variance per Sample plots are available.
Plots available in the Residuals menu when Discard Residuals is selected

9.3.3 How to display validation results


First, one should display the PCA or regression results as plots in the Viewer. When the
results plots have been opened in the Viewer one can access the Plot and the View menus
to select the various results to plot and interpret. Alternatively, the plots can be selected
from the Plots folder in the model node in the project navigator.
For more on these plots see the following sections:

 Interpreting PCA plots


 Interpreting PLS regression plots

Details: Review the overview of results


Results - PCA
Display the PCA Overview results. From here additional results plots can be accessed
from the menu.

408
Validation

Results - Regression
Display the PLSR Overview results. From here additional results plots can be
accessed from the menu.
Results - All
Display results for any analysis.

Validation plots and statistics


Plot - Variances and RMSEP
Plot variance curves and estimated Prediction Error (PCA, PCR, PLSR).
Plot - Predicted vs. Reference
Display plot of predicted Y values against actual Y values.
Plot Statistics
Display statistics (including RMSEP) on Predicted vs. Reference plot by using the
toolbar short cut.
Plot - Residuals
Display various types of residual plots.
Validation
Toggle Validation results on/off on current plot.
Calibration
Toggle Calibration results on/off on current plot.
Outlier Warnings
Display general warnings issued during the analysis – among others related to
validation. The Outlier Warnings are in the project navigator under the analysis
node.

9.3.4 How to display uncertainty test results


First, one should display the PCA or regression results. When the results plots have been
opened in the Viewer one can access the Plot and the View menus to select the various
results to plot and interpret. Alternatively, the plots can be selected from the Plots folder in
the model node in the project navigator.
See tutorial M for a guide to uncertainty plots; variable selection and model stability.

Details: How to display uncertainty results


Hotelling’s T² Ellipse
Display Hotelling’s T² ellipse on a scores plot using the toolbar short cut.
Uncertainty Test - Stability Plot
Display stability plot for scores or loadings using the toolbar short cut.
Plot - Important Variables
Display uncertainty limits on regression coefficients plot.
Correlation Loadings
Change a loadings plot to display correlation loadings by using the toolbar short cut.

409
The Unscrambler X Main

9.4. Validation tab – Cross validation setup…

The options available for Cross Validation include:


Full
Also known as Leave One Out (LOO) cross validation, this produces as many
calibration submodels as there are samples in the data set.
Random
One can choose the Number of segments a data set is to be divided up into and the
cross validation procedure randomly selects the number of samples to take, as
defined in the Samples per segment drop-down list. The number of segments may
be adjusted, depending on the size of the sample set and the number of samples to
take per segment.
Custom
Allows the user to manually choose the Number of Segments and to define the
samples for each segment by manual entry or by using the Select button. The Select
button takes one to the Define Range dialog box.
Systematic
The Unscrambler® provides two options for systematic sample selection.
Systematic (112233)
Allows the user to define the Number of segments and the Samples per segment. In
this case, the first N samples are removed for segment 1 and successfully replaced
for the number of segments defined. This is particularly useful when replicate
measures exist and are ordered together in the data matrix, allowing one to see the
impact of removing a complete replicate from a data set.
Systematic (123123)
Allows the user to look at the impact of removing a single replicate from a group of
replicate measures to assess the precision of the developed model.

410
Validation

Category variable
Allows for model cross validation by removing samples belonging to defined
categories as a group. This is useful for evaluating how robust the model is across
season, raw material supplier, location, operator etc.

411
10. Transform
10.1. Transformations
This section covers transformations available in The Unscrambler®. Transformation (or what
is often referred to as preprocessing) is applied to data to reduce or remove effects in data
which do not carry relevant information for the modeling of the system. Transformations
can reduce the complexity of a model (fewer factors needed) and improve the
interpretability of the data and models. Transformations include the application of
derivatives to spectral data to reduce baseline offset and tilt effects, while accentuating
small spectral differences. Scattering corrections are often used as transformations to
diffuse reflectance spectra to reduce differences such as light scatter and path length. These
transforms can only be performed on numerical data. Some of them cannot be performed
when there is missing data (i.e. Norris-Gap derivative).
The Unscrambler® provides the following transformations:

 Baseline correction
 Center_and_scale
 Compute general
 COW
 Deresolve
 Derivatives
 Detrending
 MSC/EMSC
 Interaction & Square Effects
 Missing_value_imputation
 Noise
 Normalize
 OSC
 Quantile_Normalize
 Reduce and average
 Smoothing
 Spectroscopic transformations
 SNV
 Transpose
 Weights
 Interpolation

More details regarding transformation methods available in The Unscrambler® are given in
the Method References.

10.2. Baseline Correction


10.2.1 Baseline correction
Baseline corrections are used to adjust the spectral offset by either adjusting the data to the
minimum point in the data, or by making a linear correction based on two user-defined
variables.

413
The Unscrambler X Main

 How it works
 How to use it

10.2.2 About baseline corrections


Baseline corrections are used to adjust the spectral offset by either adjusting the data to the
minimum point in the data, or by making a linear correction based on two user-defined
variables. Baseline offset and Linear baseline correction are transformations used to correct
the baseline of samples, and are set in the dialog Tasks - Transform - Baseline. They are
mostly used for spectroscopic purposes. The two transformations can be executed
separately or together. In the combined case the Linear baseline correction will be run first,
then the Baseline offset.
Baseline offset
The formula for the baseline offset correction can be written as follows:

where x is a variable and X denotes all selected variables for this sample.
For each sample, the value of the lowest point in the spectrum is subtracted from all the
variables. The result of this is that the minimum value is set as 0 and the rest are positive
values. To use this consistently for a set of samples, make sure that the lowest point pertains
to the same variable for all samples.
Linear baseline correction
This transformation transforms a sloped baseline into a horizontal baseline. The technique is
to point out two variables which should define the new baseline. These are both defined as
0, and the rest of the variables are transformed according to this with linear
interpolation/extrapolation. It is important to take precautions not to select basis variables
that have spectroscopic bands. As for the offset correction, make sure that the lowest points
pertain to the same variables for all samples.

10.2.3 Tasks – Transform – Baseline


Baseline offset and Linear baseline correction are transformations used to correct the
baseline of samples, and are set in the dialog Tasks - Transform - Baseline. They are mostly
used for spectroscopic purposes. The two transformations can be executed separately or
together, but at least one transformation method must be selected. In the combined case
the Linear baseline correction will be run first, then the Baseline offset.
Baseline correction cannot be carried out with non-numeric data, but can proceed if there
are missing values in the data.
Baseline dialog

414
Transform

Begin by defining the data matrix from the drop-down list. This transform can also be
performed on a results matrix, which may be selected by clicking on the select result matrix
button . For the matrix, the rows and columns to be included in the computation
are then selected. If new data ranges need to be defined, choose Define to open the Define
Range dialog where new ranges can be defined. This transform requires that only numerical
data be chosen.
After the range has been selected, select the method of the baseline transformation. A
method must be selected in order to carry out the transform. If Linear baseline correction is
selected, the two variables which define the new baseline must also be defined (Baseline
end variables). The first and last variables are selected by default. The first and last values
must be different for the transform to be performed. By checking the Preview result, one
can see the outcome of the data when the baseline transformations has been applied.
When the baseline transformation is completed, a new matrix is created in the project with
the word Baseline appended to the original matrix name. This name may be changed by
selecting the matrix, right clicking and selecting Rename from the menu.

415
The Unscrambler X Main

Method options
Choose between two baseline transforms:
Baseline offset
Subtract the value of the lowest point in the spectrum is subtracted from all the
variables.
Linear baseline correction
Transform a sloped baseline into a horizontal baseline.
Do not select basis variables that have spectroscopic bands.
For the offset correction in both methods, make sure that the lowest points pertain to the
same variables for all samples.

10.3. Center and Scale


10.3.1 Center_and_scale
Centering is often the first stage of multivariate modeling. It involves subtracting an average
value from each variable in order to investigate the variation around the average rather than
the absolute values of the observations. Depending on the data and the problem at hand,
other values than the mean may also be subtracted.
Scaling involves division of each variable by its estimated spread, using either the standard
deviation or other measures of variability. Scaling is particularly important if the variables
differ a lot in their relative magnitudes, as variables with larger variance are given more
influence in regression analysis.

 How it works
 How to use it

10.3.2 About centering


Centering using the average value, also called mean centering, ensures that the resulting
data or model may be interpreted in terms of variation around the mean. This is often the
preferred pre-processing method, as it focuses on differences between observations rather
than their absolute values. As a robust alternative to the mean, the median may be used
instead. The median will more likely put the origin in the ‘center of mass’ in cases where
some of the variables may be distributed non-symmetrically.
In some situations, for instance for chromatographic concentrations, it may not make sense
to use negative values at all. Subtraction of the minimum value will ensure non-negativity
for all variables.
The alternative to data centering is to keep the raw data origin for all variables. This is only
advisable in the special case of a regression model where it is known in advance that the
linear relationship between X and Y is expected to pass through zero. In The Unscrambler®
one may apply mean, median, minimum as a pre-processing step, or choose not to center
the data.
Scaling involves dividing the (centered) variables by individual measures of dispersion. Using
the Standard Deviation as the scaling factor sets the variance for each variable to one, and is
usually applied after mean centering. Other scaling options available in The Unscrambler®
are Interquartile Range (IQR), Range, and Scaled Median Absolute Deviation (MAD). All these
are non-parametric methods and are often used in combination with median centering.

416
Transform

The range is the difference between the highest and lowest observation for each variable.
Such scaling results in a range of one for all variables. The presence of outliers in the data
will heavily influence this transformation, however. A safer alternative would be to use the
IQR, which is the the difference between the observations at the 25th and 75th percentiles.
(There are several different ways of calculating the IQR, and The Unscrambler® utilizes the
‘Type 7’ algorithm of Hyndman and Fan, 1996.) As extreme observations are not included in
the IQR estimate, it is less likely to be affected by outliers.
The MAD is defined as the median of absolute differences between each observation in the
column and the median observation. This measure of population spread is little affected by
the tail behaviour of the distribution. For instance if a histogram of the data reveals a ‘wide’
peak where many observations fall in the tails, the standard deviation will be grossly inflated
while the MAD will remain a good estimate for the population’s spread. The MAD will
similarly be more robust for data with sharp peaks and long tails. The Scaled MAD is the
MAD multiplied with the factor 1.4826. This makes the estimate similar to the standard
deviation when many observations are collected from a normal distribution.
Centering and/or scaling data may be useful to study the data in various plots, or prior to
running Tasks – Analyze – Descriptive Statistics. It may for example allow one to compare
the distributions of variables of different scales within one plot. In subsequent analysis,
these scaled variables will contribute similarly to the model regardless of measurement unit.
These transformations are all column-oriented: the transformed values are computed as a
function of the values in the same column of the data table.
Notes: 1. Mean centering is included as a default option in the relevant analysis
dialogs, and the computations are done as a first stage of the analysis. Scaling using
the standard deviation may be applied in the Weights tabs of most analysis dialogs.
2. Centering and scaling are also available as a transformation to be performed
manually from the Editor (Tasks – Transform – Center_and_scale). Use this dialog
to perform one of the available non-parametric centering and scaling options.
A special type of standardization is the Spherize function Martinez and Martinez, 2005. It is
the multivariate equivalent of the univariate scaling methods described above. The
transformed variables have a p-dimensional mean of 0 and a covariance matrix given by the
identity matrix. It is also known in some application domains as the whitening
transformation since the resulting matrix has the signal properties of “white noise”.
More details regarding center and scale methods are given in the Method References.

10.3.3 Tasks – Transform – Center and Scale


Centering and/or scaling of data may be useful to study the data in various plots, or prior to
running Tasks - Analyze – Descriptive Statistics. Centering and scaling are widely applied in
order to transform the data to comparable levels and scale units prior to analysis.
These transformations are column-oriented: the transformed values are computed as a
function of the values in the same column of the table. They cannot be applied to non-
numeric data.
Center and Scale

417
The Unscrambler X Main

Begin by defining the data matrix from the drop-down list. This transform can also be
performed on a results matrix, which may be selected by clicking on the select result matrix
button . The rows and columns to be included in the computation must be
specified as well. If new data ranges need to be defined, choose Define to open the Define
Range dialog where new ranges can be defined.
In the Transformation frame, three options are available:
Center
within the selected sample and variable scope. This subtracts a value, e.g. the
variable mean, from each observation in each column. There is an option to center
by the mean, median, or minimum value, or not use any centering. Choose the
desired option for centering from the Center drop-down list.
Dialog showing centering options

Scale
within the selected sample and variable scope. This divides each data value by an
estimate of the of the column spread. Options available are the Standard deviation
(SDev), Interquartile range (IQR), Range, or Scaled median absolute deviation (MAD)
scaling, or not to use any scaling. Choose the desired option for scaling from the
Scale drop-down list as shown below.
Dialog showing scaling options

418
Transform

Spherize
This is a multivariate equivalent of univariate center and scaling, useful in
exploratory data analysis.
The Center and Scaling options can be selected either separately or in combination. Often
mean centering is combined with SDev scaling (autoscaling). Due to their non-parametric
nature, the Range, IQR, or Scaled MAD transformation is often used after median centering.
The type of centering and scaling is selected from the drop-down list.
By checking the Preview result box, a line plot of the observations before and after scaling is
displayed.
Notes: 1. To display the mean and standard deviation of the variables in a data set,
use menu option Tasks – Analyze- Descriptive Statistics. 2. The Center and Scale
transformations are supported in autopretreatments, meaning they can be
automatically applied when new data are analysed (classification, prediction and
sample projection analyses), using a model which was developed with this
transformation applied. See next note. 3. The principal component analysis (PCA)
and Regression dialog boxes include options for centering and scaling variables
directly at the analysis stage. It is recommended to perform centering and scaling at
the model-building stage, especially if the model will be used for future prediction
or classification. The same centering and scaling options will be applied as when the
model was built. 4. Centering and/or scaling the data more than once will not affect
the structure of the data any further. Consequently, if the Center and Scale
transformation has been applied to the data from the Tasks – Transform – Center
and Scale dialog, the data may harmlessly be recentered and/or rescaled at the
modeling stage (PCA or regression).

10.4. Compute General


10.4.1 Compute general
The transform Compute_General can be used to make general mathematical
transformations to samples and/or variables.

 How it works

419
The Unscrambler X Main

 How to use it

10.4.2 About compute general


One can use the transform Compute_General to make computations on selected samples,
variables or a matrix range using basic elementary and trigonometric functions.
Additional functions for computation on the entire data matrix are available with the Matrix
calculator: Tools - Matrix Calculator… has options for linear algebra, matrix operations and
reshaping of data.

10.4.3 Tasks – Transform – Compute_General…


This opens the Compute dialog, where one can perform arithmetic and more advanced
computations on the whole data matrix or on selected rows (samples) or columns
(variables). This option also helps in transforming variables. Computations cannot be
performed with non-numeric data.
Compute_General

Begin by defining the data matrix from the drop-down list. This transform can also be
performed on a results matrix, which may be selected by clicking on the select result matrix
button . For the matrix, the rows and columns to be included in the computation
are then selected.
If new data ranges need to be defined, choose Define to open the Define Range dialog
where new ranges can be defined. One must also define if the selection is for the variables
or samples.
There are three ways of defining the mathematical expression to be applied:

 Type the mathematical expression directly in the Expression box,

420
Transform

 Use the drop-down list, which provides the most recently used expressions (if this is
the first time using the Compute_General dialog, no formerly used expressions will
show in the drop-down list).
 Click on the Build Expression button. This opens the Build Expression dialog wherein
a mathematical expression can be defined using the ready-made functions and
operators allowed in The Unscrambler®.

Syntax
The Expression field accepts a formula of the type: X=LN(ABS(X))-e or S4=(S1*S2)+S3 or
V1=V1/2+SIN(V8/V9) where S stands for sample, V stands for variable, and the number is
the sample or variable number in the Editor. To build general expressions that are not
related to a particular sample or variable, use X. X stands for the whole matrix defined by the
variable and sample set chosen in Scope. RH and CH are row and column headers,
respectively.
Note: The formula cannot contain mixed references to samples (S), variables (V)
and X.

Available functions and operators


The constants, operators, and functions that are allowed in computations are listed below:
Table: Operators, functions and constants allowed in computations
Name Description

+ Addition

- Subtraction

* Multiplication

/ Division

= Equals to

( Left Parenthesis

) Right Parenthesis

ABS(X) Absolute value of X

SQRT(X) Square root of X

POW(X,n) Power of X, with exponent n: Xn

LOG(X) Briggs logarithm (base 10)

LN(X) Natural logarithm (base e)

EXP(X) Exponential(X)=eX

MIN(X1,X2,…) Minimum value

MAX(X1,X2,…) Maximum value

421
The Unscrambler X Main

Name Description

SIGN(X) -1 if X < 0, 1 if X >= 0

ANINT(X) Nearest integer (rounding)

AINT(X) Integer part of X

COS(X) Cosine

SIN(X) Sine

TAN(X) Tangent

COSH(X) Hyperbolic cosine

SINH(X) Hyperbolic sine

TANH(X) Hyperbolic tangent

ACOS(X) Inverse cosine (radians)

ASIN(X) Inverse sine (radians)

ATAN(X) Inverse tangent (radians)

ATAN2(X1,X2) Four quadrant inverse tangent

PI 3.14

e 2.718
”X” can denote both samples and variables in this table.
Function names are case insensitive, meaning that log, Log, and LOG will give the same
result. In the above functions a comma is used as list separator, however this depends on
the regional settings of the computer. Different list separators may be valid for different
contries, e.g. POW(X;n).
Notes: A commonly used expression is X=log(X). This expression generally
transforms skewed variable distributions into more symmetrical ones. Use a
histogram plot or Tasks – Analyze – Descriptive Statistics… in order to check
whether the skewness was improved or deteriorated after applying the
transformation.

Build expression dialog


In the Expression Builder dialog a mathematical expression can be built using the ready-
made functions and operators allowed in The Unscrambler®.
Expression Builder

422
Transform

The upper text field shows the expression as it is being built. In Display, choose whether the
text field should show the sample/variable Numbers or the sample/variable Names. In the
Insert field, choose to insert specific samples, specific variables or (general expression). After
choosing the Sample or the Variable options, the drop-down list is enabled and one can
select the relevant object(s) from the list. The available samples or variables are only those
belonging to the Scope formerly selected in the Compute dialog.
The Arithmetic Functions, Trigonometric Functions, Other Functions, and Numbers fields
offer buttons that are used following the same principle as for a calculator.
Click Clear to clear the expression. Click Undo to undo the latest insertion in the expression
text. Click OK to return to the Compute_General dialog.

10.5. COW
10.5.1 Correlation Optimized Warping (COW)
COW is a method for aligning data where the signals exhibit shifts in their position along the
x axis. COW cannot be performed with non-numeric data, or when there are missing data.
 How it works
 How to use it

423
The Unscrambler X Main

10.5.2 About correlation optimized warping


COW is a method for aligning data where the signals exhibit shifts in their position along the
x axis. COW can be used to eliminate shift-related artifacts to measurement data by
correcting a sample vector to a reference. COW has applicability to data where there can be
a poor alignment of the x axis from sample to sample, as can be the case with
chromatographic data, Raman spectra and NMR spectra. One example of such data is
chromatography where peak positions change between samples due to changes in mobile
phase or deterioration of the column. Another example is in NMR spectroscopy where
matrix effects and the chemistry itself induce position changes in the chemical shifts.
The method works by finding the optimal correlation between defined segments of the data
for which there is a shift in position. The result of this procedure is one shift value per
segment. These are then interpolated to give a so-called shift-vector for all data points, and
a mapping function (move-back operator) which moves the samples back to the reference
profile’s position. The present implementation handles data of similar length only. To cope
with various lengths, it is suggested to pad the data table out with zeros before performing
the shift alignment. Alignment is done by allowing small changes in the segment length on
the sample vector, and those segment lengths being shifted (“warped”) to optimize the
correlation between the sample and the reference vector. Slack refers to the maximum
increase or decrease in sample segment length, and provides flexibility in optimizing the
correlation between the samples and reference.
The reference sample is the sample in the data which is used as the reference, and should be
a representative sample with the main peaks present.
Segment length is defined by the user, and is the size of the data segment that data are
divided into before searching for the optimal correlation. It must be smaller than the
number of variables divided by 4.
The slack is the flexibility in adjusting the segment size to give the optimal fit to the
reference data, and is the allowed change in position to be searched for. Slack is <= segment.
The figures below illustrate the result of applying the COW preprocessing to chromatograms.
Raw chromatograms

424
Transform

Chromatograms after COW preprocessing (segment = 100, slack = 20)

More details regarding COW are given in the Method References.

10.5.3 Tasks – Transform – Correlation Optimized Warping…


Correlation Optimized Warping (COW) is a row-oriented transformation for aligning data
where the signals exhibit shifts in their position along the x axis. This can be applicable to
data sets where there may be differences due to alignment differences that arise from the

425
The Unscrambler X Main

measurement (such as in chromatography retention times, chemical shifts in NMR data, and
Raman spectral x axis alignment).
COW cannot be performed with non-numeric data, or when there are missing data. The
minimum number of variables required to use COW is 20.
COW Dialog

Begin by defining the data matrix from the drop-down list. This transform can also be
performed on a results matrix, which may be selected by clicking on the select result matrix
button . For the matrix, the rows and columns to be included in the computation
are then selected. If new data ranges need to be defined, choose Define to open the Define
Range dialog where new ranges can be defined.
Three inputs must be specified in the dialog:
 Reference Sample: Select which sample in the data table is to act as the reference
profile.
This is a typical sample (e.g. near the origin in a scores plot) with preferably the main
peaks present. If the COW will be applied to new data at some later point of time,
include the reference sample in a new data table as well.
 Segment Size. This is the length of the segment which the data are divided into
before searching for the optimal correlation. It must be smaller than the number of
variables divided by 4.
 Slack: Slack represents the allowed change in position to be searched for and has the
value <= Segment Size.
By selecting the preview result, one can see how the transformed data will look.
COW dialog with preview

426
Transform

When the COW transformation is completed, a new matrix is created in the project with the
word COW appended to the original matrix name. This name may be changed by selecting
the matrix, right clicking and selecting Rename from the menu.

10.6. Deresolv
10.6.1 Deresolve
The Deresolve function can be used to change the apparent resolution of an instrument,
changing a high resolution spectrum to low resolution. It may also be used for noise
reduction.

 How it works
 How to use it

427
The Unscrambler X Main

10.6.2 About deresolve


On occasion, one may wish to standardize a lower resolution instrument to a higher
resolution instrument. This may be the case when transferring data from one instrument to
another with the intention of calibration model transfer. In such an instance, it may be more
effective to mathematically lower the resolution of the higher resolution instrument prior to
forming the transfer model. The Deresolve function can be used to change the apparent
resolution of an instrument, changing a high resolution spectrum to a lower resolution by
downsampling the signal. Deresolve may also be used for noise reduction.
Deresolve uses a triangle kernel filter for smoothing to convolve spectra with a resolution
function in order to make it appear as if it had been taken on a lower resolution instrument.
The inputs are the high resolution spectra to be deresolved and the number of channels to
convolve them over. The output is the estimate of the lower resolution spectra with the
original number of variables maintained.
More details regarding the Deresolve method are given in the Method References.

10.6.3 Tasks – Transform – Deresolve


The Deresolve function can be used to change the apparent resolution of an instrument,
changing a high resolution spectrum to low resolution. It may also be used for noise
reduction. It is a row-oriented transformation; that is to say the contents of a cell are likely
to be influenced by its horizontal neighbors. This transformation cannot be applied to non-
numeric data.
A new data matrix with the deresolved data will be created in the project where the original
data matrix resides.
Deresolve

428
Transform

Begin by defining the data matrix from the drop-down list. This transform can also be
performed on a results matrix, which may be selected by clicking on the select result matrix
button . For the matrix, the rows and columns to be included in the computation
are then selected. If new data ranges need to be defined, choose Define to open the Define
Range dialog where new ranges can be defined. There must be at least 4 variables to
perform the deresolve transformation.
In the Parameters field, choose the number of channels to use for convolution. The
minimum number of channels that can be used is 2, and the maximum is (#variables/2)
By selecting the preview result, one can see how the transformed data will look.
When the deresolve transformation is completed, a new matrix is created in the project with
the word Deresolve appended to the original matrix name. This name may be changed by
selecting the matrix, right clicking and selecting Rename from the menu.

10.7. Derivatives
10.7.1 Derivatives
Differentiation, i.e. computing derivatives of various orders, is a classical technique widely
used for spectroscopic applications. Some of the information “hidden” in a spectrum may be
more easily revealed when working on a first or second derivative. It is a row-oriented

429
The Unscrambler X Main

transformation; that is to say the contents of a cell are likely to be influenced by its
horizontal neighbors.
Derivatives cannot be performed with non-numeric data or where there are missing data.
Like smoothing, this transformation is relevant for variables which are themselves a function
of some underlying variable, e.g. absorbance at various wavelengths. Computing a derivative
is also called differentiation. Derivatives can help to resolve overlapped bands, but also lead
to a lower signal in the transformed data.
The segment parameter of Gap-Segment derivatives is an interval over which data values are
averaged.
In smoothing, X-values are averaged over one segment symmetrically surrounding a data
point. The raw value on this point is replaced by the average over the segment, thus creating
a smoothing effect.
In Gap-Segment derivatives (designed by Karl Norris), X-values are averaged separately over
one segment on each side of the data point. The two segments are separated by a gap. The
raw value on this point is replaced by the difference of the two averages, thus creating an
estimate of the derivative on this point.
The Unscrambler® offers three methods for computing derivatives, as described in the
following sections:

 Gap_Derivatives
 Gap-Segment
 Savitzky-Golay

10.7.2 About derivative methods and applications


Derivatives are applied to correct for baseline effects in spectra for the purpose of removing
nonchemical effects and creating robust calibration models. Derivatives may also aid in
resolving overlapped bands which can provide a better understanding of the data,
emphasizing small spectral variations not evident in the raw data.
The first derivative
The first derivative of a spectrum is simply a measure of the slope of the spectral curve at
every point. The slope of the curve is not affected by purely additive baseline offsets in the
spectrum, and thus the first derivative is a very effective method for removing such offsets.
However, peaks in raw spectra usually become zero-crossing points in first derivative
spectra, which can be difficult to interpret.
Example:
To illustrate how derivatives work, Gaussian curves of various offsets and intensities are
used to demonstrate the principles. These curves are shown below.
Gaussian curves of various offsets and intensities

430
Transform

Mathematically, a derivative is the slope of the curve. If purely additive noise (like in the
curves above) is present, this is a constant. Therefore under derivatization, the constant
reduces to zero, meaning that all spectra should have a mean of zero and the spectral
profiles should be changed to the slopes of the curves.
The next figure displays the first order derivative for the Gaussian curves.
First derivative of Gaussian curves

There are two points to note here,

 The baseline offset has been removed under derivatization


 The peak maxima in the raw data has now become a zero point in the derivative.

The zero point can be explained by the fact that at a peak maxima (minima), the derivative is
zero.
In complex spectra, there may be many zero points and while it is adequate to transform a
purely linear offset with a first derivative, interpretation of zero points becomes difficult.
The second derivative may be useful in this instance.

431
The Unscrambler X Main

The second derivative


The second derivative is a measure of the change in the slope of the curve. In addition to
removing pure additive offset, it is not affected by any linear “tilt” that may exist in the data,
and is therefore a very effective method for removing both the baseline offset and slope
from a spectrum. The second derivative can help resolve nearby peaks and sharpen spectral
features. Peaks in raw spectra change sign and turn to negative peaks with lobes on either
side in the second derivative.
Example:
Returning to the Gaussian curves, the second derivative can be conceptualized as the slope
of the first derivative. Therefore at the zero point in the first derivative, the slope is
maximum and in this case will result in the original raw data maxima being minima in the
second derivative. The figure below demonstrates this.
Second derivative of Gaussian curves

Another important feature of the second derivative is that the intensities of the original
curves can be seen in the second derivatives in order of intensity. This is an extremely useful
property, especially when performing quantitative analyses such as regression analysis.
Third and fourth derivatives
Third and fourth derivatives are available in The Unscrambler® although they are not as
popular as first and second derivatives. They may reveal phenomena which do not appear
clearly when using lower-order derivatives and can be helpful in understanding the spectral
data. Prudent use of the fourth derivative has been shown to emphasize small variations
caused by temperature changes and compositional changes. Higher-order derivatives do
significantly reduce the signal in the transformed data.
Savitzky-Golay vs. Gap-Segment
The Savitzky-Golay method and the Gap-Segment method use information from a localized
segment of the spectrum to calculate the derivative at a particular wavelength rather than
the difference between adjacent data points. In most cases, this avoids the problem of noise
enhancement from the simple difference method and may actually apply some smoothing to
the data.
The Gap-Segment method requires gap size and smoothing segment size (usually measured
in wavelength span, but sometimes in terms of data points). The Savitzky-Golay method uses
a convolution function, and thus the number of data points (segment) in the function must

432
Transform

be specified. If the segment is too small, the result may be no better than using the simple
difference method. If it is too large, the derivative will not represent the local behavior of
the spectrum (especially in the case of Gap-Segment), and it will smooth out too much of the
important information (especially in the case of Savitzky-Golay). Although there have been
many studies done on the appropriate size of the spectral segment to use, a good general
rule is to use a sufficient number of points to cover the full width at half height of the largest
absorbing band in the spectrum. One can also find optimum segment sizes by checking
model accuracy and robustness under different segment size settings.
Example:
Using data from a FT-NIR spectrometer, the next figure shows what happens when the
selected segment size is too small (Savitzky-Golay derivative, 3 points segment and second
order of polynomial). Noisy features remain in the spectra when the . segment size is too
small
Derivatized data with a segment size set too small

In the figure that follows, the selected segment size is too large: (Savitzky-Golay derivative,
31 points segment and second order of polynomial). One can see that some relevant
information has been smoothed out.
Derivatized data with a segment size set too large

433
The Unscrambler X Main

The main disadvantage of using derivative preprocessing is that the resulting spectra can be
difficult to interpret. However, this can also be advantageous, especially when a user is
looking for both specificity and selectivity of particular constituents in complex sample
matrices.
More details regarding Derivative transforms are given in the Method References.

10.7.3 Gap Derivatives


Gap derivative
This is a special case of Gap-Segment Derivative with segment size = 1 and therefore does
not smooth the data. This derivative requires that the data all be numeric and that there are
at least five variables for each sample, and no missing values.
Properties of Gap-segment and Gap derivatives
Karl Norris has developed a powerful approach for the pretreatment of near-infrared
spectral data in which two distinct items are involved. The first is the Gap Derivative, the
second is the “Norris Regression”, which may or may not use the derivatives. The Gap
Derivative is applied to improve the rejection of interfering absorbers. The “Norris
Regression” is a regression procedure to reduce the impact of varying baseline, variable path
lengths, and high stray light among samples due to scatter effects. .
In Gap-Segment derivatives (designed by Karl Norris), X-values are averaged separately over
one segment on each side of the data point. The two segments are separated by a gap. The
raw value on this point is replaced by the difference of the two averages, thus creating an
estimate of the derivative on this point.
Tasks – Transform – Derivative – Gap Derivative
This method computes derivatives of up