657 tayangan

Diunggah oleh johneiver

MANUAL

- Machine Learning Section4 eBook v03
- The Unscrambler
- regresyon
- Contoh Jurnal Supri.docx
- t7_2009
- howweendureenduranceportfolio-jenniferdabbert
- Classification & Division
- 2007_33_1_3.pdf
- Spreadsheet Determines Hyperbolic-Decline Parameters - Oil & Gas Journal
- Handbook_of_Near-Infrared_Analysis_3rd_edition_(2008)
- SIMCA-P+_12_Tutorial
- Regression Analysis
- MAT2337 December 2010 Final Exam
- User Friendly Multivariate Calibration GP
- SIMCA-P+ 11 Tutorial
- SAS STAT®9.2 User’s Guide Introduction to Statistical Modeling with SASSTAT Software (Book Excerpt)
- ch4projectoliviaandmaddie
- Binary Logistic Regression and its application
- EnviRM3(Multivariate)
- Basic Statistical Reporting

Anda di halaman 1dari 1370

3

User Manual

Version 1.0

CAMO SOFTWARE AS

Nedre Vollgate 8, N-0158, Oslo, NORWAY

Tel: (47) 223 963 00

Fax: (47) 223 963 22

E-mail : info@camo.com | www.camo.com

i

The Unscrambler X v10.3

Copyright

All intellectual property rights in this work belong to CAMO Software AS. The information contained in this work

must not be reproduced or distributed to others in any form or by any means, electronic or mechanical, for any

purpose, without the express written permission of CAMO Software AS. This document is provided on the

understanding that its use will be confined to the officers of the organization (whose name is stated on the front

cover of this document) who acquired it and that no part of its contents will be disclosed to third parties without

prior written consent of CAMO Software AS.

Copyright © 2014 CAMO Software AS. All Rights Reserved

All other trademarks and copyrights mentioned in the document are acknowledged and belong to their respective

owners.

Disclaimer

This document has been reviewed and quality assured for accuracy of content. Succeeding versions of this

document are subject to change without notice and will reflect changes made to subsequent software version.

It is the sole responsibility of the organization using this document to ensure all tests meet the criteria specified in

the test scripts. CAMO Software takes no responsibility for the end use of the product as this requires the

performance of suitable feasibility trials and performance qualification to ensure the software is fit for purpose for

its intended use.

ii

Table of Contents

Table of Contents

1. Welcome to The Unscrambler® X ................................................................................. 1

2. Support Resources........................................................................................................ 3

3. Overview ...................................................................................................................... 5

3.1.1 Multivariate analysis simplified ............................................................................................... 5

3.1.2 Make well-designed experimental plans ................................................................................. 5

3.1.3 Reformat, transform and plot data .......................................................................................... 6

3.1.4 Study variations among one group of variables ....................................................................... 6

3.1.5 Study relations between two groups of variables.................................................................... 7

3.1.6 Validate multivariate models with uncertainty testing ............................................................ 7

3.1.7 Estimate new, unknown response values ................................................................................ 8

3.1.8 Classify unknown samples ....................................................................................................... 8

3.1.9 Reveal groups of samples ........................................................................................................ 8

3.2. Principles of classification ......................................................................................... 8

3.2.1 Purposes of classification ......................................................................................................... 9

3.2.2 Classification methods ............................................................................................................. 9

3.2.3 Steps in SIMCA classification .................................................................................................. 11

3.2.4 Classifying new samples......................................................................................................... 11

3.2.5 Outcomes of a classification .................................................................................................. 11

3.2.6 Classification based on a regression model ........................................................................... 12

3.3. How to use help ...................................................................................................... 12

3.3.1 How to open the help documentation................................................................................... 12

3.3.2 Browsing the contents ........................................................................................................... 12

3.3.3 Searching the contents .......................................................................................................... 12

3.3.4 Typographic cues ................................................................................................................... 13

3.4. Principles of regression ........................................................................................... 13

3.4.1 What is regression?................................................................................................................ 13

3.4.2 Multiple Linear Regression (MLR) .......................................................................................... 15

3.4.3 Principal Component Regression (PCR) ................................................................................. 16

3.4.4 Partial Least Squares Regression (PLSR)................................................................................. 16

3.4.5 L-PLS Regression .................................................................................................................... 17

3.4.6 Support Vector Machine Regression (SVMR)......................................................................... 18

3.4.7 Calibration, validation and related samples........................................................................... 18

3.4.8 Main results of regression ..................................................................................................... 19

3.4.9 Making the right choice with regression methods................................................................. 21

3.4.10 How to interpret regression results ....................................................................................... 22

3.4.11 Guidelines for calibration of spectroscopic data ................................................................... 24

iii

The Unscrambler X v10.3

4.2. Getting to know the user interface......................................................................... 30

4.2.1 Application window ............................................................................................................... 30

4.2.2 Workspace ............................................................................................................................. 31

4.2.3 Project navigator.................................................................................................................... 32

4.2.4 Project information ................................................................................................................ 32

4.2.5 Page tab bar ........................................................................................................................... 32

4.2.6 The menu bar ......................................................................................................................... 32

4.2.7 The toolbar ............................................................................................................................ 33

4.2.8 The status bar ........................................................................................................................ 33

4.2.9 Dialogs ................................................................................................................................... 33

4.2.10 Setting up the user environment ........................................................................................... 34

4.2.11 Getting help ........................................................................................................................... 34

4.3. Matrix editor basics ................................................................................................ 34

4.3.1 What is a matrix? ................................................................................................................... 35

4.3.2 Adding data matrices ............................................................................................................. 36

4.3.3 Altering data tables ................................................................................................................ 36

4.3.4 Using ranges........................................................................................................................... 37

4.3.5 Data types .............................................................................................................................. 38

4.3.6 Keeping versions of data ........................................................................................................ 39

4.3.7 Saving data ............................................................................................................................. 39

4.4. Using the project navigator .................................................................................... 40

4.4.1 About the project navigator................................................................................................... 40

4.4.2 Create a project ..................................................................................................................... 40

4.4.3 Items in a project ................................................................................................................... 41

4.4.4 Browse a project .................................................................................................................... 41

4.4.5 Managing items in a project .................................................................................................. 41

4.5. Register pretreatment ............................................................................................ 44

4.6. Save model for prediction, classification ................................................................ 44

4.7. Set Alarms ............................................................................................................... 46

4.7.1 Prediction: .............................................................................................................................. 46

4.7.2 Classification: ......................................................................................................................... 47

4.7.3 Projection:.............................................................................................................................. 47

4.7.4 Input: ..................................................................................................................................... 48

4.8. Set Components ...................................................................................................... 49

4.9. Set Bias and Slope ................................................................................................... 49

4.9.1 Algorithm ............................................................................................................................... 50

4.9.2 Menu option .......................................................................................................................... 50

4.9.3 Usage ..................................................................................................................................... 50

iv

Table of Contents

4.10.1 Non-Compliance mode .......................................................................................................... 51

4.10.2 Compliance Mode .................................................................................................................. 53

4.11. File ........................................................................................................................... 54

4.11.1 File menu ............................................................................................................................... 54

4.11.2 File – Print… ........................................................................................................................... 55

4.12. Edit .......................................................................................................................... 57

4.12.1 Edit menu ............................................................................................................................... 57

4.12.2 Edit – Change data type – Category… .................................................................................... 65

4.12.3 Edit – Category Property… ..................................................................................................... 70

4.12.4 Edit – Fill................................................................................................................................. 71

4.12.5 Edit – Find and Replace .......................................................................................................... 72

4.12.6 Edit – Go To… ......................................................................................................................... 74

4.12.7 Edit – Insert – Category Variable… ......................................................................................... 75

4.12.8 Edit – Define Range… ............................................................................................................. 77

4.12.9 Edit – Reverse… ...................................................................................................................... 85

4.12.10 Edit – Group rows…................................................................................................................ 85

4.12.11 Edit – Sample grouping… ....................................................................................................... 86

4.12.12 Scalar and Vector ................................................................................................................... 87

4.12.13 Split Text Variable .................................................................................................................. 88

4.13. View ........................................................................................................................ 90

4.13.1 View menu ............................................................................................................................. 90

4.14. Insert ....................................................................................................................... 93

4.14.1 Insert menu ............................................................................................................................ 93

4.14.2 Insert – Duplicate Matrix… ..................................................................................................... 94

4.14.3 Insert – Data Matrix… ............................................................................................................ 95

4.14.4 Insert – Custom Layout… ....................................................................................................... 96

4.14.5 Insert – Data Compiler… ...................................................................................................... 100

4.15. Plot ........................................................................................................................ 103

4.15.1 Plot menu............................................................................................................................. 103

4.16. Tasks...................................................................................................................... 104

4.16.1 Tasks menu .......................................................................................................................... 104

4.17. Tools ...................................................................................................................... 106

4.17.1 Tools menu .......................................................................................................................... 106

4.17.2 Tools – Audit Trail… ............................................................................................................. 107

4.17.3 Tools – Matrix Calculator… .................................................................................................. 108

4.17.4 Tools – Options… ................................................................................................................. 111

4.17.5 Tools – Report… ................................................................................................................... 113

4.18. Help ....................................................................................................................... 115

4.18.1 Help menu ........................................................................................................................... 115

4.18.2 Help – Modify License… ....................................................................................................... 116

4.18.3 Help – User Setup… .............................................................................................................. 117

v

The Unscrambler X v10.3

5.1.1 Supported data formats ....................................................................................................... 119

5.1.2 How to import data.............................................................................................................. 121

5.2. ASCII ...................................................................................................................... 122

5.2.1 ASCII (CSV, text) ................................................................................................................... 122

5.2.2 About ASCII, CSV and tabular text files ................................................................................ 122

5.2.3 File – Import Data – ASCII… .................................................................................................. 123

5.3. BRIMROSE ............................................................................................................. 125

5.3.1 Brimrose............................................................................................................................... 125

5.3.2 About Brimrose data files .................................................................................................... 126

5.3.3 File – Import Data – Brimrose… ........................................................................................... 126

5.4. Bruker.................................................................................................................... 128

5.4.1 OPUS from Bruker ................................................................................................................ 128

5.4.2 About Bruker (OPUS) instrument files ................................................................................. 129

5.4.3 File – Import Data – OPUS… ................................................................................................. 129

5.5. DataBase ............................................................................................................... 132

5.5.1 Databases............................................................................................................................. 132

5.5.2 About supported database interfaces ................................................................................. 133

5.5.3 File – Import Data – Database… ........................................................................................... 133

5.6. DeltaNu ................................................................................................................. 139

5.6.1 DeltaNu ................................................................................................................................ 139

5.6.2 About DeltaNu data files ...................................................................................................... 139

5.6.3 File – Import Data – DeltaNu… ............................................................................................. 139

5.7. Excel ...................................................................................................................... 142

5.7.1 Microsoft Excel spreadsheets .............................................................................................. 142

5.7.2 About Microsoft Excel spreadsheets ................................................................................... 143

5.7.3 File – Import Data – Excel… .................................................................................................. 143

5.8. GRAMS .................................................................................................................. 144

5.8.1 GRAMS from Thermo Scientific ........................................................................................... 144

5.8.2 About the GRAMS data format ............................................................................................ 144

5.8.3 File – Import Data – GRAMS… .............................................................................................. 145

5.9. GuidedWave.......................................................................................................... 148

5.9.1 CLASS-PA & SpectrOn from Guided Wave ........................................................................... 148

5.9.2 About Guided Wave CLASS-PA & SpectrOn data files .......................................................... 149

5.9.3 File – Import Data – CLASS-PA & SpectrOn… ....................................................................... 149

5.10. Import Interpolate ................................................................................................ 152

5.10.1 Interpolate functionality ...................................................................................................... 152

5.11. Indico..................................................................................................................... 155

5.11.1 Indico ................................................................................................................................... 155

5.11.2 About ASD Inc. Indico data files ........................................................................................... 155

vi

Table of Contents

5.12. JcampDX ................................................................................................................ 159

5.12.1 JCAMP-DX ............................................................................................................................ 159

5.12.2 About the JCAMP-DX file format.......................................................................................... 160

5.12.3 File – Import Data – JCAMP-DX… ......................................................................................... 160

5.12.4 JCAMP-DX file format reference .......................................................................................... 163

5.13. Konica_Minolta ..................................................................................................... 165

5.13.1 Konica_Minolta .................................................................................................................... 165

5.13.2 About Konica_Minolta data files .......................................................................................... 166

5.13.3 File – Import Data – Konica_Minolta… ................................................................................. 166

5.14. Matlab ................................................................................................................... 167

5.14.1 Matlab.................................................................................................................................. 167

5.14.2 About Matlab data files ....................................................................................................... 168

5.14.3 File – Import Data – Matlab… .............................................................................................. 168

5.15. MyInstrument ....................................................................................................... 169

5.15.1 MyInstrument ...................................................................................................................... 169

5.15.2 About the MyInstrument standard ...................................................................................... 169

5.15.3 File – Import Data – MyInstrument… ................................................................................... 170

5.16. NetCDF .................................................................................................................. 173

5.16.1 NetCDF ................................................................................................................................. 173

5.16.2 About the NetCDF file format .............................................................................................. 173

5.16.3 File – Import Data – NetCDF… .............................................................................................. 173

5.17. NSAS ...................................................................................................................... 174

5.17.1 NSAS..................................................................................................................................... 174

5.17.2 About the NSAS file format .................................................................................................. 174

5.17.3 File – Import Data – NSAS… ................................................................................................. 175

5.17.4 NSAS file format reference .................................................................................................. 177

5.18. Omnic .................................................................................................................... 179

5.18.1 OMNIC ................................................................................................................................. 179

5.18.2 About Thermo OMNIC data files .......................................................................................... 180

5.18.3 File – Import Data – OMNIC… .............................................................................................. 180

5.19. OPC........................................................................................................................ 183

5.19.1 OPC protocol ........................................................................................................................ 183

5.19.2 About the OPC protocol ....................................................................................................... 183

5.19.3 File – Import Data – OPC… ................................................................................................... 184

5.20. OSISoftPI ............................................................................................................... 185

5.20.1 PI .......................................................................................................................................... 185

5.20.2 About supported interfaces ................................................................................................. 185

5.20.3 File – Import Data – PI… ....................................................................................................... 185

5.21. PerkinElmer ........................................................................................................... 189

5.21.1 PerkinElmer.......................................................................................................................... 189

5.21.2 About PerkinElmer instrument files ..................................................................................... 190

vii

The Unscrambler X v10.3

5.22. PertenDX ............................................................................................................... 193

5.22.1 Perten-DX ............................................................................................................................. 193

5.22.2 About the Perten Instruments JCAMP-DX file format.......................................................... 194

5.22.3 File – Import Data – Perten-DX… ......................................................................................... 194

5.22.4 Perten-DX file format reference .......................................................................................... 197

5.23. RapID ..................................................................................................................... 199

5.23.1 RapID.................................................................................................................................... 199

5.23.2 About RapID data files ......................................................................................................... 199

5.23.3 File – Import Data – rap-ID… ................................................................................................ 199

5.24. U5Data .................................................................................................................. 202

5.24.1 U5 Data ................................................................................................................................ 202

5.24.2 About Unscrambler� 5.0 data files..................................................................................... 202

5.24.3 File – Import Data – U5 Data… ............................................................................................. 203

5.25. UnscFileReader ..................................................................................................... 204

5.25.1 The Unscrambler® 9.8 .......................................................................................................... 204

5.25.2 About The Unscrambler® 9.8 file formats ............................................................................ 205

5.25.3 File – Import Data – Unscrambler… ..................................................................................... 205

5.25.4 The Unscrambler® 9.x file format reference ........................................................................ 205

5.26. UnscramblerX........................................................................................................ 206

5.26.1 The Unscrambler® X ............................................................................................................. 206

5.26.2 About The Unscrambler® X file format ................................................................................ 207

5.26.3 File – Import Data – Unscrambler X… .................................................................................. 207

5.27. Varian .................................................................................................................... 208

5.27.1 Varian ................................................................................................................................... 208

5.27.2 About Varian data files ........................................................................................................ 208

5.27.3 File – Import Data – Varian… ............................................................................................... 209

5.28. VisioTec ................................................................................................................. 212

5.28.1 VisioTec ................................................................................................................................ 212

5.28.2 About VisioTec data files ...................................................................................................... 213

5.28.3 File – Import Data – VisioTec…............................................................................................. 213

6.1.1 Supported data formats ....................................................................................................... 215

6.1.2 How to export data .............................................................................................................. 215

6.2. AMO ...................................................................................................................... 215

6.2.1 Export models to ASCII......................................................................................................... 215

6.2.2 About the ASCII-MOD file format ........................................................................................ 215

6.2.3 File – Export – ASCII-MOD… ................................................................................................. 215

6.2.4 ASCII-MOD file format reference ......................................................................................... 216

6.3. ASCII ...................................................................................................................... 221

viii

Table of Contents

6.3.2 File – Export – ASCII…........................................................................................................... 222

6.4. DeltaNu ................................................................................................................. 223

6.4.1 DeltaNu ................................................................................................................................ 223

6.4.2 File – Export – DeltaNu… ...................................................................................................... 223

6.5. JCampDX ............................................................................................................... 224

6.5.1 JCAMP-DX export ................................................................................................................. 224

6.5.2 File – Export – JCAMP-DX… .................................................................................................. 224

6.6. Matlab ................................................................................................................... 226

6.6.1 Matlab export ...................................................................................................................... 226

6.6.2 File – Export – Matlab… ....................................................................................................... 226

6.7. NetCDF .................................................................................................................. 227

6.7.1 NetCDF export ..................................................................................................................... 227

6.7.2 File – Export – NetCDF…....................................................................................................... 227

6.8. UnscFileWriter ...................................................................................................... 229

6.8.1 Export models to The Unscrambler® v9.8 ............................................................................ 229

6.8.2 About The Unscrambler® file format ................................................................................... 229

6.8.3 File – Export – Unscrambler… .............................................................................................. 230

7. Plots.......................................................................................................................... 231

7.2. Bar plot.................................................................................................................. 232

7.3. Scatter plot............................................................................................................ 234

7.4. 3-D scatter plot ..................................................................................................... 236

7.5. Matrix plot ............................................................................................................ 243

7.6. Histogram plot ...................................................................................................... 247

7.7. Normal probability plot......................................................................................... 248

7.8. Multiple scatter plot ............................................................................................. 250

7.9. Tabular summary plots ......................................................................................... 252

7.10. Special plots .......................................................................................................... 253

7.11. Plotting results from several matrices .................................................................. 255

7.11.1 Why is it useful? ................................................................................................................... 255

7.11.2 How to do it? ....................................................................................................................... 257

7.12. Annotating plots ................................................................................................... 258

7.13. Create Range Menu .............................................................................................. 259

7.14. Plotting: The smart way to display numbers ........................................................ 260

7.14.1 Various plots ........................................................................................................................ 260

7.14.2 Customizing plots ................................................................................................................. 261

7.14.3 Actions on a plot .................................................................................................................. 261

7.14.4 Plots in analysis .................................................................................................................... 261

ix

The Unscrambler X v10.3

7.16. Marking ................................................................................................................. 266

7.16.1 How to mark samples/variables .......................................................................................... 266

7.16.2 How to create a new range of samples or variables from the marked items ...................... 268

7.16.3 Recalculate with modifications on marked samples or/and variables ................................. 269

7.17. Point details .......................................................................................................... 270

7.18. Formatting of plots ............................................................................................... 271

7.19. Formatting of 3D plots .......................................................................................... 274

7.20. Plot – Response Surface… ..................................................................................... 278

7.21. Saving and copying a plot ..................................................................................... 279

7.21.1 Saving a plot ......................................................................................................................... 279

7.21.2 Copying plots ....................................................................................................................... 280

7.22. Scope: Select plot range........................................................................................ 282

7.23. Edit – Select Evenly Distributed Samples .............................................................. 283

7.24. Zooming and Rescaling ......................................................................................... 284

7.24.1 General options ................................................................................................................... 284

7.24.2 Special options ..................................................................................................................... 285

7.24.3 Resize plots .......................................................................................................................... 285

8.2. Introduction to Design of Experiments (DoE) ....................................................... 287

8.2.1 DoE basics ............................................................................................................................ 288

8.2.2 Investigation stages and design objectives .......................................................................... 289

8.2.3 Available designs in The Unscrambler® ............................................................................... 291

8.2.4 Types of variables in experimental design ........................................................................... 293

8.2.5 Designs for unconstrained screening situations .................................................................. 295

8.2.6 Designs for unconstrained optimization situations ............................................................. 299

8.2.7 Designs for constrained situations ....................................................................................... 302

8.2.8 Types of samples in experimental design ............................................................................ 315

8.2.9 Sample order in a design...................................................................................................... 319

8.2.10 Blocking................................................................................................................................ 319

8.2.11 Extending a design ............................................................................................................... 321

8.2.12 Building an efficient experimental strategy ......................................................................... 322

8.2.13 Analyze results from designed experiments ........................................................................ 323

8.2.14 Advanced topics for unconstrained situations ..................................................................... 330

8.2.15 Advanced topics for constrained situations ......................................................................... 331

8.3. Insert – Create design… ........................................................................................ 334

8.3.1 General buttons ................................................................................................................... 334

8.3.2 Start ..................................................................................................................................... 334

8.3.3 Define Variables ................................................................................................................... 336

x

Table of Contents

8.3.5 Design Details ...................................................................................................................... 341

8.3.6 Additional Experiments ........................................................................................................ 352

8.3.7 Randomization ..................................................................................................................... 355

8.3.8 Summary .............................................................................................................................. 357

8.3.9 Design Table ......................................................................................................................... 357

8.4. Tools – Modify/Extend Design… ........................................................................... 358

8.4.1 To remember ....................................................................................................................... 359

8.5. Tasks – Analyze – Analyze Design Matrix… ........................................................... 360

8.5.1 Order of the runs ................................................................................................................. 361

8.5.2 Level values .......................................................................................................................... 361

8.6. DoE analysis .......................................................................................................... 361

8.7. Analysis results...................................................................................................... 365

8.8. Interpreting design analysis plots ......................................................................... 366

8.8.1 Accessing plots ..................................................................................................................... 367

8.8.2 Available plots for Classical DoE Analysis (Scheffe and MLR) .............................................. 367

8.8.3 Available plots for Partial Least Squares Regression (DoE PLS) ........................................... 387

8.9. DOE method reference ......................................................................................... 394

8.10. Bibliography .......................................................................................................... 394

9.2. Introduction to validation ..................................................................................... 397

9.2.1 Principles of model validation .............................................................................................. 397

9.2.2 What is validation? .............................................................................................................. 398

9.2.3 Validation results ................................................................................................................. 400

9.2.4 When to use which validation method ................................................................................ 400

9.2.5 Uncertainty testing with cross validation ............................................................................ 401

9.2.6 More details about the uncertainty test .............................................................................. 402

9.2.7 Model validation check list .................................................................................................. 404

9.3. Validation tab ........................................................................................................ 405

9.3.1 Analysis and validation procedures ..................................................................................... 405

9.3.2 Validation methods .............................................................................................................. 406

9.3.3 How to display validation results ......................................................................................... 408

9.3.4 How to display uncertainty test results ............................................................................... 409

9.4. Validation tab – Cross validation setup….............................................................. 410

10.2. Baseline Correction ............................................................................................... 413

10.2.1 Baseline correction .............................................................................................................. 413

xi

The Unscrambler X v10.3

10.2.3 Tasks – Transform – Baseline ............................................................................................... 414

10.3. Center and Scale ................................................................................................... 416

10.3.1 Center_and_scale ................................................................................................................ 416

10.3.2 About centering ................................................................................................................... 416

10.3.3 Tasks – Transform – Center and Scale ................................................................................. 417

10.4. Compute General .................................................................................................. 419

10.4.1 Compute general ................................................................................................................. 419

10.4.2 About compute general ....................................................................................................... 420

10.4.3 Tasks – Transform – Compute_General… ............................................................................ 420

10.5. COW ...................................................................................................................... 423

10.5.1 Correlation Optimized Warping (COW) ............................................................................... 423

10.5.2 About correlation optimized warping .................................................................................. 424

10.5.3 Tasks – Transform – Correlation Optimized Warping… ....................................................... 425

10.6. Deresolv ................................................................................................................ 427

10.6.1 Deresolve ............................................................................................................................. 427

10.6.2 About deresolve ................................................................................................................... 428

10.6.3 Tasks – Transform – Deresolve ............................................................................................ 428

10.7. Derivatives ............................................................................................................ 429

10.7.1 Derivatives ........................................................................................................................... 429

10.7.2 About derivative methods and applications ........................................................................ 430

10.7.3 Gap Derivatives .................................................................................................................... 434

10.7.4 Gap Segment........................................................................................................................ 436

10.7.5 Savitzky Golay ...................................................................................................................... 438

10.8. Detrend ................................................................................................................. 440

10.8.1 Detrending ........................................................................................................................... 440

10.8.2 About detrending ................................................................................................................. 440

10.8.3 Tasks – Transform – Detrending .......................................................................................... 442

10.9. EMSC ..................................................................................................................... 443

10.9.1 MSC/EMSC ........................................................................................................................... 443

10.9.2 About multiplicative scatter correction ............................................................................... 444

10.9.3 Tasks – Transform – MSC/EMSC .......................................................................................... 445

10.10. Interaction and Square Effects .................................................................... 451

10.10.1 Interaction_and_Square_Effects ......................................................................................... 451

10.10.2 About interactions and square effects ................................................................................. 451

10.10.3 Tasks – Transform – Interactions and Square Effects .......................................................... 452

10.11. Interpolate ................................................................................................... 453

10.11.1 Interpolation ........................................................................................................................ 453

10.11.2 About interpolation ............................................................................................................. 453

10.11.3 Tasks – Transform – Interpolate .......................................................................................... 454

10.12. Missing Value Imputation ............................................................................ 455

10.12.1 Fill missing values................................................................................................................. 455

xii

Table of Contents

10.12.3 Tasks – Transform – Fill Missing… ........................................................................................ 456

10.13. Noise ............................................................................................................ 457

10.13.1 Noise .................................................................................................................................... 457

10.13.2 About adding noise .............................................................................................................. 457

10.13.3 Tasks – Transform – Noise ................................................................................................... 457

10.14. Normalize ..................................................................................................... 459

10.14.1 Normalization ...................................................................................................................... 459

10.14.2 About normalization ............................................................................................................ 460

10.14.3 Tasks – Transform – Normalize ............................................................................................ 462

10.15. OSC ............................................................................................................... 466

10.15.1 Orthogonal Signal Correction (OSC) ..................................................................................... 466

10.15.2 About Orthogonal Signal Correction (OSC) .......................................................................... 466

10.15.3 Tasks – Transform – OSC… ................................................................................................... 467

10.16. Quantile Normalize ...................................................................................... 470

10.16.1 Quantile Normalization ........................................................................................................ 470

10.16.2 About quantile normalization .............................................................................................. 470

10.16.3 Tasks – Transform – Quantile_Normalize ............................................................................ 471

10.17. Reduce Average ........................................................................................... 472

10.17.1 Reduce (Average) ................................................................................................................. 472

10.17.2 About averaging ................................................................................................................... 473

10.17.3 Tasks – Transform – Reduce (Average)… ............................................................................. 473

10.18. Smoothing .................................................................................................... 474

10.18.1 Smoothing methods ............................................................................................................. 474

10.18.2 Comparison of moving average and Gaussian filters ........................................................... 474

10.18.3 Gaussian Filter ..................................................................................................................... 475

10.18.4 Median Filter........................................................................................................................ 476

10.18.5 Moving Average ................................................................................................................... 478

10.18.6 Robust LOWESS.................................................................................................................... 479

10.18.7 Savitzky Golay ...................................................................................................................... 481

10.19. Spectroscopic Transformations ................................................................... 483

10.19.1 Spectroscopic transformations ............................................................................................ 483

10.19.2 About spectroscopic transformations.................................................................................. 484

10.19.3 Tasks – Transform – Spectroscopic… ................................................................................... 484

10.20. Standard Normal Variate ............................................................................. 486

10.20.1 Standard_Normal_Variate (SNV) ......................................................................................... 486

10.20.2 About Standard_Normal_Variate (SNV) .............................................................................. 487

10.20.3 Tasks – Transform – SNV ...................................................................................................... 487

10.21. Transpose..................................................................................................... 488

10.21.1 Transposition ....................................................................................................................... 488

10.21.2 Tasks – Transform – Transpose ............................................................................................ 488

10.22. Weighted Direct Standardization ................................................................ 489

xiii

The Unscrambler X v10.3

10.22.2 About Weighted_Direct_Standardization ............................................................................ 489

10.22.3 Tasks – Transform – Weighted_Direct_Standardization ...................................................... 489

10.23. Weights ........................................................................................................ 489

10.23.1 Weights ................................................................................................................................ 489

10.23.2 About weighting and scaling ................................................................................................ 490

10.23.3 Tasks – Transform – Weights… ............................................................................................ 492

11.2. Introduction to descriptive statistics .................................................................... 497

11.2.1 Purposes .............................................................................................................................. 497

11.2.2 The normal distribution ....................................................................................................... 498

11.2.3 Measures of central tendency ............................................................................................. 499

11.2.4 Measures of dispersion ........................................................................................................ 499

11.3. Tasks – Analyze – Descriptive Statistics… ............................................................. 501

11.3.1 Data input ............................................................................................................................ 501

11.3.2 Some important tips regarding the data input dialog .......................................................... 501

11.4. Interpreting descriptive statistics plots ................................................................ 502

11.4.1 Predefined descriptive statistics plots ................................................................................. 502

11.4.2 Plots accessible from the Statistics plot menu ..................................................................... 504

11.5. Descriptive statistics method reference ............................................................... 508

11.6. Bibliography .......................................................................................................... 508

12.2. Introduction to statistical tests ............................................................................. 509

12.2.1 What are inferential statistics? ............................................................................................ 510

12.2.2 Hypothesis testing ............................................................................................................... 510

12.2.3 Tests for normality of data................................................................................................... 512

12.2.4 Tests for the equivalence of variances ................................................................................. 513

12.2.5 Tests for the comparison of means ..................................................................................... 515

12.2.6 Comparison of categorical data ........................................................................................... 517

12.3. Tasks – Analyze – Statistical Tests… ...................................................................... 518

12.4. Interpreting plots for statistical tests ................................................................... 523

12.4.1 Predefined plots for statistical tests .................................................................................... 524

12.5. Statistical tests method reference ........................................................................ 526

12.6. Bibliography .......................................................................................................... 526

xiv

Table of Contents

13.2. Introduction to Principal Component Analysis (PCA) ........................................... 527

13.2.1 Exploratory data analysis ..................................................................................................... 528

13.2.2 What is PCA? ........................................................................................................................ 528

13.2.3 Purposes of PCA ................................................................................................................... 528

13.2.4 How PCA works in short ....................................................................................................... 529

13.2.5 Main result outputs of PCA .................................................................................................. 533

13.2.6 How to interpret PCA results ............................................................................................... 536

13.2.7 PCA rotation ......................................................................................................................... 539

13.2.8 PCA algorithm options ......................................................................................................... 542

13.3. Tasks – Analyze – Principal Component Analysis… ............................................... 542

13.3.1 Model Inputs tab ................................................................................................................. 543

13.3.2 Weights tab .......................................................................................................................... 544

13.3.3 Validation tab....................................................................................................................... 546

13.3.4 Rotation tab ......................................................................................................................... 547

13.3.5 Algorithm tab ....................................................................................................................... 548

13.3.6 Autopretreatment tab ......................................................................................................... 550

13.3.7 Set Alarms tab ...................................................................................................................... 551

13.3.8 Warning Limits tab ............................................................................................................... 551

13.4. Interpreting PCA plots........................................................................................... 553

13.4.1 Predefined PCA plots ........................................................................................................... 554

13.4.2 Plots accessible from the PCA plot menu ............................................................................ 571

13.5. PCA method reference .......................................................................................... 582

13.6. Bibliography .......................................................................................................... 582

14.2. Introduction to Multiple Linear Regression (MLR) ............................................... 583

14.2.1 Basics ................................................................................................................................... 583

14.2.2 Principles behind Multiple Linear Regression (MLR)............................................................ 585

14.2.3 Interpreting the results of MLR ............................................................................................ 586

14.2.4 More details about regression methods .............................................................................. 589

14.3. Tasks – Analyze – Multiple Linear Regression ...................................................... 589

14.3.1 Model Inputs tab ................................................................................................................. 589

14.3.2 Validation tab....................................................................................................................... 591

14.3.3 Autopretreatments tab ........................................................................................................ 594

14.3.4 Set Alarms tab ...................................................................................................................... 594

14.3.5 Warning Limits tab ............................................................................................................... 595

14.3.6 Variable weighting in MLR ................................................................................................... 596

14.4. Interpreting MLR plots .......................................................................................... 597

14.4.1 Predefined MLR plots........................................................................................................... 598

14.4.2 Plots accessible from the MLR Plot menu ............................................................................ 610

xv

The Unscrambler X v10.3

14.6. Bibliography .......................................................................................................... 616

15.2. Introduction to Principal Component Regression (PCR) ....................................... 617

15.2.1 Basics ................................................................................................................................... 617

15.2.2 Interpreting the results of a Principal Component Regression (PCR) .................................. 618

15.2.3 Some more theory of PCR .................................................................................................... 620

15.2.4 PCR algorithm options ......................................................................................................... 620

15.3. Tasks – Analyze – Principal Component Regression ............................................. 621

15.3.1 Model Inputs tab ................................................................................................................. 621

15.3.2 Weights tabs ........................................................................................................................ 623

15.3.3 Validation tab....................................................................................................................... 625

15.3.4 Algorithm tab ....................................................................................................................... 626

15.3.5 Autopretreatment tab ......................................................................................................... 628

15.3.6 Set Alarms tab ...................................................................................................................... 629

15.3.7 Warning Limits tab ............................................................................................................... 629

15.4. Interpreting PCR plots ........................................................................................... 631

15.4.1 Predefined PCR plots ........................................................................................................... 634

15.4.2 Plots accessible from the PCR plot menu ............................................................................. 658

15.5. PCR method reference .......................................................................................... 673

15.6. Bibliography .......................................................................................................... 673

16.2. Introduction to Partial Least Squares Regression (PLSR) ...................................... 675

16.2.1 Basics ................................................................................................................................... 675

16.2.2 Interpreting the results of a PLS regression ......................................................................... 676

16.2.3 Scores and loadings (in general) .......................................................................................... 677

16.2.4 More details about regression methods .............................................................................. 680

16.2.5 PLSR algorithm options ........................................................................................................ 681

16.3. Tasks – Analyze – Partial Least Squares Regression ............................................. 682

16.3.1 Model Inputs tab ................................................................................................................. 682

16.3.2 Weights tabs ........................................................................................................................ 684

16.3.3 Validation tab....................................................................................................................... 686

16.3.4 Algorithm tab ....................................................................................................................... 687

16.3.5 Autopretreatments tab ........................................................................................................ 689

16.3.6 Set Alarms tab ...................................................................................................................... 690

16.3.7 Warning Limits tab ............................................................................................................... 690

16.4. Interpreting PLS plots............................................................................................ 692

xvi

Table of Contents

16.4.2 Plots accessible from the PLS plot menu ............................................................................. 726

16.5. PLS method reference........................................................................................... 742

16.6. Bibliography .......................................................................................................... 742

17.2. Introduction to L-PLS ............................................................................................ 743

17.2.1 Basics ................................................................................................................................... 743

17.2.2 The L-PLS model ................................................................................................................... 744

17.2.3 L-PLS by example ................................................................................................................. 745

17.3. Tasks – Analyze – L-PLS Regression ...................................................................... 746

17.3.1 Model inputs ........................................................................................................................ 746

17.3.2 X weights .............................................................................................................................. 748

17.3.3 Y weights .............................................................................................................................. 750

17.3.4 Z weights .............................................................................................................................. 750

17.4. Interpreting L-PLS plots......................................................................................... 751

17.4.1 Predefined L-PLS plots ......................................................................................................... 751

17.4.2 Plots accessible from the L-PLS menu .................................................................................. 758

17.5. L-PLS method reference ........................................................................................ 758

17.6. Bibliography .......................................................................................................... 758

18.2. Introduction to Support Vector Machine (SVM) Regression (SVMR) ................... 759

18.2.1 Principles of Support Vector Machine (SVM) regression ..................................................... 759

18.2.2 What is SVM regression? ..................................................................................................... 760

18.2.3 Data suitable for SVM Regression ........................................................................................ 761

18.2.4 Main results of SVM regression ........................................................................................... 762

18.2.5 More details about SVM Regression .................................................................................... 763

18.3. Tasks – Analyze – Support Vector Machine Regression… ..................................... 763

18.3.1 Model input ......................................................................................................................... 763

18.3.2 Options ................................................................................................................................ 765

18.3.3 Grid Search........................................................................................................................... 768

18.3.4 Weights ................................................................................................................................ 768

18.3.5 Validation ............................................................................................................................. 770

18.4. Tasks – Predict – SVR Prediction… ........................................................................ 772

18.5. Interpreting SVM Regression results .................................................................... 773

18.5.1 Support vectors.................................................................................................................... 774

18.5.2 Parameters........................................................................................................................... 774

18.5.3 Probabilities ......................................................................................................................... 774

xvii

The Unscrambler X v10.3

18.5.5 Prediction ............................................................................................................................. 775

18.5.6 Prediction plot ..................................................................................................................... 775

18.5.7 Predicted values after appplying the SVM model on new samples ..................................... 776

18.6. SVM method reference ......................................................................................... 776

18.7. Bibliography .......................................................................................................... 777

19.2. Introduction to Multivariate Curve Resolution (MCR).......................................... 779

19.2.1 MCR basics ........................................................................................................................... 780

19.2.2 Ambiguities and constraints in MCR .................................................................................... 782

19.2.3 MCR and 3-D data ................................................................................................................ 785

19.2.4 Algorithm implemented in The Unscrambler®: Alternating Least Squares (MCR-ALS) ........ 786

19.2.5 Main results of MCR ............................................................................................................ 788

19.2.6 Quality check in MCR ........................................................................................................... 789

19.2.7 MCR application examples ................................................................................................... 790

19.3. Tasks – Analyze – Multivariate Curve Resolution… .............................................. 791

19.3.1 Model Inputs ........................................................................................................................ 791

19.3.2 Options ................................................................................................................................ 792

19.4. Interpreting MCR plots ......................................................................................... 793

19.4.1 Predefined MCR plots .......................................................................................................... 794

19.5. MCR method reference ........................................................................................ 797

19.6. Bibliography .......................................................................................................... 797

20.2. Introduction to Hierarchical Modeling ................................................................. 799

20.2.1 Overall workflow.................................................................................................................. 799

20.2.2 Setup .................................................................................................................................... 800

20.2.3 Expected Scenarios .............................................................................................................. 800

20.3. Tasks – Analyze – Hierarchical Modeling .............................................................. 804

20.3.1 Defining actions ................................................................................................................... 805

20.3.2 Setting up a hierarchical model ........................................................................................... 811

20.3.3 Modifying an existing hierarchical model ............................................................................ 819

20.4. Prediction with Hierarchical Model ...................................................................... 819

20.5. Interpretation of results........................................................................................ 820

xviii

Table of Contents

21.3. Tasks – Analyze – Segmented Correlation Outlier Analysis… ............................... 826

21.4. Tasks - Predict - Conformity… ............................................................................... 829

21.5. SCA Conformity Prediction Plots........................................................................... 830

21.5.1 Predefined prediction plots ................................................................................................. 830

21.6. Save model for SCA Conformity Prediction .......................................................... 832

21.7. Interpreting SCA plots ........................................................................................... 833

21.7.1 Predefined SCA plots ........................................................................................................... 834

21.8. SCA method reference .......................................................................................... 843

22.2. Introduction to Instrument Diagnostics................................................................ 845

22.2.1 RMS Noise ............................................................................................................................ 845

22.2.2 Peak Height/Peak Area (Peak Model) .................................................................................. 846

22.2.3 Peak Position........................................................................................................................ 846

22.2.4 Loss of Intensity ................................................................................................................... 847

22.2.5 PCA Projection ..................................................................................................................... 847

22.3. Tasks – Analyze – Instrument Diagnostics ............................................................ 847

22.3.1 Main Dialog .......................................................................................................................... 847

22.3.2 Add Model ........................................................................................................................... 848

22.3.3 RMS Noise ............................................................................................................................ 849

22.3.4 Peak Model .......................................................................................................................... 851

22.3.5 Peak Position........................................................................................................................ 854

22.3.6 Single Loss of Intensity Model ............................................................................................. 857

22.3.7 Principal Component Analysis Models ................................................................................. 858

22.4. Prediction with Instrument Diagnostics Model .................................................... 861

23.2. Introduction to Spectral Diagnostics .................................................................... 865

23.2.1 RMS Noise ............................................................................................................................ 865

23.2.2 Peak Height/Peak Area (Peak Model) .................................................................................. 866

23.2.3 Peak Position........................................................................................................................ 866

23.2.4 Loss of Intensity ................................................................................................................... 867

23.2.5 PCA Projection ..................................................................................................................... 867

23.3. Tasks – Analyze – Spectral Diagnostics ................................................................. 867

23.3.1 Main Dialog .......................................................................................................................... 867

23.3.2 Add Model ........................................................................................................................... 868

23.3.3 RMS Noise ............................................................................................................................ 869

23.3.4 Peak Model .......................................................................................................................... 871

xix

The Unscrambler X v10.3

23.3.6 Single Loss of Intensity Model ............................................................................................. 876

23.3.7 Principal Component Analysis Models ................................................................................. 878

23.4. Prediction with Spectral Diagnostics Model ......................................................... 880

24.2. Introduction to cluster analysis ............................................................................ 883

24.2.1 Basics ................................................................................................................................... 883

24.2.2 Principles of cluster analysis ................................................................................................ 884

24.2.3 Nonhierarchical clustering ................................................................................................... 884

24.2.4 Hierarchical clustering ......................................................................................................... 884

24.2.5 Quality of the clustering ...................................................................................................... 887

24.2.6 Main results of cluster analysis ............................................................................................ 888

24.3. Tasks – Analyze – Cluster Analysis… ..................................................................... 888

24.3.1 Inputs ................................................................................................................................... 889

24.3.2 Options for K-means/K-median clustering ........................................................................... 889

24.3.3 Results.................................................................................................................................. 891

24.4. Interpreting cluster analysis plots......................................................................... 892

24.4.1 Dendrogram ......................................................................................................................... 892

24.5. Cluster analysis method reference ....................................................................... 893

25.2. Introduction to projection of samples .................................................................. 895

25.2.1 Basics of projection .............................................................................................................. 895

25.2.2 How to interpret projected samples .................................................................................... 896

25.3. Tasks – Predict – Projection… ............................................................................... 898

25.3.1 Access the Projection functionality ...................................................................................... 898

25.4. Interpreting projection plots ................................................................................ 900

25.4.1 Predefined projection plots ................................................................................................. 901

25.4.2 Plots accessible from the Projection menu .......................................................................... 906

25.5. Projection method reference................................................................................ 913

26.2. Introduction to SIMCA classification..................................................................... 915

26.2.1 Making a SIMCA model ........................................................................................................ 915

26.2.2 Classifying new samples....................................................................................................... 916

26.2.3 Main results of classification................................................................................................ 916

xx

Table of Contents

26.3. Tasks – Predict – Classification – SIMCA… ............................................................ 918

26.4. Interpreting SIMCA plots ...................................................................................... 921

26.4.1 Predefined SIMCA plots ....................................................................................................... 921

26.5. SIMCA method reference ..................................................................................... 926

27.2. Introduction to Linear Discriminant Analysis (LDA) classification ........................ 927

27.2.1 Basics ................................................................................................................................... 927

27.2.2 Data suitable for LDA ........................................................................................................... 928

27.2.3 Purposes of LDA ................................................................................................................... 928

27.2.4 Main results of LDA .............................................................................................................. 929

27.2.5 LDA application examples .................................................................................................... 929

27.2.6 How to interpret LDA results ............................................................................................... 929

27.2.7 Using an LDA model for classification of unknowns ............................................................ 930

27.3. Tasks – Analyze – Linear Discriminant Analysis .................................................... 930

27.3.1 Inputs ................................................................................................................................... 930

27.3.2 Weights ................................................................................................................................ 931

27.3.3 Options ................................................................................................................................ 932

27.3.4 Autopretreatment ............................................................................................................... 933

27.4. Tasks – Predict – Classification – LDA… ................................................................ 934

27.5. Interpreting LDA results ........................................................................................ 934

27.5.1 Prediction ............................................................................................................................. 935

27.5.2 Confusion matrix .................................................................................................................. 935

27.5.3 Loadings matrix .................................................................................................................... 936

27.5.4 Grand mean matrix .............................................................................................................. 936

27.5.5 Discrimination Plot............................................................................................................... 936

27.6. LDA method reference .......................................................................................... 936

27.7. Bibliography .......................................................................................................... 937

28.2. Introduction to Support Vector Machine (SVM) classification ............................. 939

28.2.1 Principles of Support Vector Machine (SVM) classification ................................................. 939

28.2.2 What is SVM classification? ................................................................................................. 939

28.2.3 Data suitable for SVM classification ..................................................................................... 941

28.2.4 Main results of SVM classification ....................................................................................... 941

28.2.5 More details about SVM Classification ................................................................................ 942

28.2.6 SVM classification application examples ............................................................................. 942

28.3. Tasks – Analyze – Support Vector Machine classification .................................... 942

xxi

The Unscrambler X v10.3

28.3.2 Options ................................................................................................................................ 943

28.3.3 Grid Search........................................................................................................................... 946

28.3.4 Weights ................................................................................................................................ 947

28.3.5 Validation ............................................................................................................................. 948

28.4. Tasks – Predict – Classification – SVM… ............................................................... 950

28.5. Interpreting SVM Classification results ................................................................. 951

28.5.1 Support vectors.................................................................................................................... 951

28.5.2 Confusion matrix .................................................................................................................. 951

28.5.3 Parameters........................................................................................................................... 952

28.5.4 Probabilities ......................................................................................................................... 952

28.5.5 Prediction ............................................................................................................................. 953

28.5.6 Accuracy ............................................................................................................................... 953

28.5.7 Plot of classification results ................................................................................................. 954

28.5.8 Classified range .................................................................................................................... 954

28.6. SVM method reference ......................................................................................... 955

28.7. Bibliography .......................................................................................................... 955

29.2. Introduction to Batch Modeling (BM)................................................................... 957

29.2.1 What is Batch Modeling ....................................................................................................... 957

29.3. Tasks – Analyze – Batch Modeling… ..................................................................... 957

29.3.1 Model Inputs tab ................................................................................................................. 957

29.3.2 Weights tab .......................................................................................................................... 959

29.3.3 Validation tab....................................................................................................................... 961

29.3.4 Warning Limits tab ............................................................................................................... 962

29.4. Interpreting BM plots............................................................................................ 964

29.4.1 Predefined BM plots ............................................................................................................ 965

29.5. BM method reference........................................................................................... 965

30.2. Introduction to Moving Block. .............................................................................. 967

30.2.1 Block Definitions .................................................................................................................. 967

30.2.2 Individual Block Mean (IBM) ................................................................................................ 968

30.2.3 Individual Block Standard Deviation (IBSD).......................................................................... 969

30.2.4 Moving Block Mean (MBM) ................................................................................................. 969

30.2.5 Moving Block Standard Deviation (MBSD) ........................................................................... 969

30.2.6 Percent Relative Standard Deviation (%RSD) ....................................................................... 970

30.3. Tasks – Analyze – Moving Block Methods ............................................................ 971

xxii

Table of Contents

30.3.2 Region .................................................................................................................................. 971

30.4. Interpreting moving block plots............................................................................ 972

30.4.1 Predefined moving block plots ............................................................................................ 973

30.5. Tasks – Predict – Moving Block Statistics.............................................................. 975

30.6. Set Moving Block Limits ........................................................................................ 976

31.2. Introduction to Orthogonal Projection to Latent Structures (OPLS) .................... 977

31.2.1 Predictive scores and predictive loading weights ................................................................ 978

31.2.2 Y-loadings............................................................................................................................. 978

31.2.3 Orthogonal scores and orthogonal loading weights and loadings ....................................... 978

31.3. Tasks – Analyze – Orthogonal Projection to Latent Structures ............................ 979

31.3.1 Model Inputs tab ................................................................................................................. 979

31.3.2 Weights tabs ........................................................................................................................ 980

31.3.3 Validation tab....................................................................................................................... 983

31.3.4 Autopretreatments .............................................................................................................. 984

31.4. Interpreting OPLS plots ......................................................................................... 985

31.4.1 Predefined OPLS plots.......................................................................................................... 985

31.5. OPLS method reference ........................................................................................ 994

31.6. Bibliography .......................................................................................................... 994

32.2. Introduction to prediction from regression models ............................................. 995

32.2.1 When can prediction be used? ............................................................................................ 995

32.2.2 How does prediction work? ................................................................................................. 996

32.2.3 Short prediction modes for MLR, PLSR and PCR .................................................................. 996

32.2.4 Full prediction by projection onto a PCR or PLSR model ..................................................... 996

32.2.5 Main results of prediction .................................................................................................... 997

32.3. Tasks – Predict – Regression… .............................................................................. 999

32.3.1 Access the Prediction functionality ...................................................................................... 999

32.4. Interpreting prediction plots............................................................................... 1003

32.4.1 Predefined prediction plots ............................................................................................... 1003

32.4.2 Plots accessible from the Prediction menu ........................................................................ 1004

32.5. Prediction method reference.............................................................................. 1008

xxiii

The Unscrambler X v10.3

33.2.1 Inputs and outputs ............................................................................................................. 1009

33.2.2 Display................................................................................................................................ 1010

33.2.3 Options .............................................................................................................................. 1010

33.2.4 Outputs .............................................................................................................................. 1011

34.2. Multiple comparison of y-residuals .................................................................... 1013

34.3. Tasks – Predict – Multiple Model Comparison ................................................... 1013

34.4. Interpreting prediction plots............................................................................... 1015

34.4.1 Predefined prediction plots ............................................................................................... 1015

34.5. Method reference ............................................................................................... 1015

35.1.1 Content of the tutorials ..................................................................................................... 1017

35.1.2 How to use the tutorials .................................................................................................... 1017

35.1.3 Where to find the tutorial data files .................................................................................. 1017

35.2. Complete ............................................................................................................. 1018

35.2.1 Complete cases .................................................................................................................. 1018

35.2.2 Tutorial A: A simple example of calibration ....................................................................... 1019

35.2.3 Tutorial B: Quality analysis with PCA and PLS .................................................................... 1036

35.2.4 Tutorial C: Spectroscopy and interference problems ........................................................ 1069

35.2.5 Tutorial D1: Screening design ............................................................................................ 1092

35.2.6 Tutorial D2: Optimization design ....................................................................................... 1107

35.2.7 Tutorial E: SIMCA classification .......................................................................................... 1120

35.2.8 Tutorial F: Interacting with other programs ...................................................................... 1133

35.2.9 Tutorial G: Mixture design ................................................................................................. 1148

35.2.10 Tutorial H: PLS Discriminant Analysis (PLS-DA) .................................................................. 1164

35.2.11 Tutorial I: Multivariate curve resolution (MCR) of dye mixtures ....................................... 1177

35.2.12 Tutorial J: MCR constraint settings .................................................................................... 1189

35.2.13 Tutorial K: Clustering.......................................................................................................... 1202

35.2.14 Tutorial L: L-PLS Regression ............................................................................................... 1215

35.2.15 Tutorial M: Variable selection and model stability ............................................................ 1231

35.3. Quick ................................................................................................................... 1240

35.3.1 Quick start tutorials ........................................................................................................... 1240

35.3.2 Projection quick start ......................................................................................................... 1241

35.3.3 SIMCA quick start ............................................................................................................... 1243

35.3.4 MLR quick start .................................................................................................................. 1244

35.3.5 PCR quick start ................................................................................................................... 1247

35.3.6 PLS quick start .................................................................................................................... 1254

xxiv

Table of Contents

35.3.8 Cluster quick start .............................................................................................................. 1263

35.3.9 MCR quick start .................................................................................................................. 1265

35.3.10 LDA quick start ................................................................................................................... 1268

35.3.11 LDA classification quick start.............................................................................................. 1272

35.3.12 SVM quick start .................................................................................................................. 1273

35.3.13 SVM classification quick start ............................................................................................ 1277

35.3.14 PCA quick start ................................................................................................................... 1278

36.2. Statement of Compliance ................................................................................... 1283

36.2.1 Introduction ....................................................................................................................... 1283

36.2.2 Overview ............................................................................................................................ 1283

36.2.3 Other software applications .............................................................................................. 1283

36.2.4 Statement of 21 CFR Part 11 Compliance .......................................................................... 1283

36.3. Compliance mode in The Unscrambler® X .......................................................... 1284

36.3.1 Main features of the compliance mode ............................................................................. 1284

36.3.2 A comprehensive approach to security and data integrity ................................................ 1285

36.4. Digital Signatures ................................................................................................ 1285

36.4.1 Digital Signature implementation in The Unscrambler� X ............................................... 1285

36.4.2 How to assign a digital signature to a project .................................................................... 1286

36.4.3 How to tell if a project has been signed ............................................................................. 1287

36.4.4 Digital signatures and 21 CFR Part 11 ................................................................................ 1288

36.5. References .......................................................................................................... 1288

37.2. Glossary of terms ................................................................................................ 1289

37.3. Method reference ............................................................................................... 1320

37.4. Keyboard shortcuts ............................................................................................. 1320

37.5. Smarter, simpler multivariate data analysis: The Unscrambler® X..................... 1321

37.5.1 Workflow oriented main screen ........................................................................................ 1322

37.5.2 A new look for a new generation ....................................................................................... 1322

37.5.3 New analysis methods ....................................................................................................... 1325

37.5.4 General improvements and inclusions summary ............................................................... 1327

37.6. What’s new in The Unscrambler® X version 10.3 ............................................... 1328

37.7. What’s new in The Unscrambler® X ver 10.2 ...................................................... 1329

37.8. Applicability......................................................................................................... 1329

37.9. Design of Experiments ........................................................................................ 1330

xxv

The Unscrambler X v10.3

37.11. Known Limitations in The Unscrambler® X ver 10.2 .................................. 1332

37.12. What’s new in The Unscrambler® X ver 10.1............................................. 1332

37.13. Data Import ................................................................................................ 1332

37.14. Data Export ................................................................................................ 1332

37.15. Applicability ............................................................................................... 1333

37.16. Design of Experiments ............................................................................... 1333

37.17. Overall Enhancements ............................................................................... 1333

37.18. Known Limitations in The Unscrambler® X ver 10.1 .................................. 1334

37.19. What’s new in The Unscrambler® X ver 10.0.1.......................................... 1334

37.20. Data Import ................................................................................................ 1334

37.21. Tutorials ..................................................................................................... 1334

37.22. Applicability ............................................................................................... 1335

37.23. Design of Experiments ............................................................................... 1335

37.24. Known Limitations in The Unscrambler® X ver 10.0.1 ............................... 1335

37.25. What’s new in The Unscrambler® X........................................................... 1336

37.26. System Requirements ................................................................................ 1337

37.27. Installation ................................................................................................. 1337

38.1.1 Statistics and multivariate data analysis ............................................................................ 1339

38.1.2 Basic statistical tests .......................................................................................................... 1341

38.1.3 Design of experiments ....................................................................................................... 1341

38.1.4 Multivariate curve resolution ............................................................................................ 1342

38.1.5 Classification methods ....................................................................................................... 1342

38.1.6 Data transformations and pretreatments .......................................................................... 1343

38.1.7 L-shaped PLS ...................................................................................................................... 1344

38.1.8 Martens’ uncertainty test .................................................................................................. 1344

38.1.9 Data formats ...................................................................................................................... 1344

xxvi

1. WelcometoTheUnscrambler®X

The Unscrambler® is a complete multivariate data analysis and experimental design software

solution, equipped with powerful methods including PCA, PLS, clustering and classification.

Video demonstration of the new user interface

Migrating from earlier versions

Tutorials

Keyboard shortcuts

How to use the help documentation

See the release notes for a list of fixes, new features and known limitations.

1

2. Support Resources

2.1. Support resources on our website

Our web site is filled with resources, case studies, recorded webinars as well as information

about our products and commercial offerings, including courses and professional services.

Support

Webinars

Training courses

Consulting

3

3. Overview

3.1. What is The Unscrambler® X?

A brief review of the tasks that can be carried out using The Unscrambler® X.

Make well-designed experimental plans

Reformat, transform and plot data

Study variations among one group of variables

Study relations between two groups of variables

Validate multivariate models with uncertainty testing

Estimate new, unknown response values

Classify unknown samples

Reveal groups of samples

The main strength of The Unscrambler® X is to provide simple to use tools for analysis of any

sort of multivariate data. This involves finding variations, co-variations and other internal

relationships in data matrices (tables). One can also use The Unscrambler® X set up an

experimental design to achieve the maximum information as efficiently as possible.

The following are the basic types of problems that can be solved using The Unscrambler® X:

Set up experiments, analyze effects and find optima using the Design of Experiments

(DoE) module;

Reformat and preprocess data to enhance future analyses;

Find relevant variation in one data matrix (X);

Find relationships between two data matrices (X and Y);

Validate multivariate models with Uncertainty Testing;

Resolve unknown mixtures by finding the number of pure components and

estimating their concentration profiles and spectra;

Predict the unknown values of a response variable;

Classify unknown samples into various possible categories.

One should always remember, however, that there is no point in trying to analyze data if

they do not contain any meaningful information. Experimental design is a valuable tool for

building data tables which give such meaningful information. The Unscrambler® can help to

do this in an elegant way.

The Unscrambler® satisfies the US FDA’s requirements for 21 CFR Part 11 compliance.

Choosing samples carefully increases the chance of extracting useful information from data.

Furthermore, being able to actively experiment with the variables also increases the chance

of extracting relationships. The critical part is deciding which variables to change, which

intervals to use for this variation, and the pattern of the experimental points.

5

The Unscrambler X Main

The purpose of experimental design is to generate experimental data that enable one to

determine which design variables (X) have an influence on the response variables (Y), in

order to understand the interactions between the design variables and thus determine the

optimum conditions. Of course, it is equally important to do this with a minimum number of

experiments to reduce costs. An experimental design program should offer appropriate

design methods and encourage good experimental practice, i.e. allow one to perform few

but useful experiments which span the important variations.

Screening designs (e.g. fractional, full factorial and Plackett-Burman) are used to find out

which design variables have an effect on the responses and are suitable for collection of data

spanning all important variations.

Optimization designs (e.g. central composite, Box-Behnken) aim to find the optimum

conditions for a process and generate nonlinear (quadratic) models. They generate data

tables that describe relationships in more detail, and are usually used to refine a model, i.e.

after the initial screening has been performed.

Whether the purpose of designed experiments is screening or optimization, there may be

multilinear constraints among some of the design variables. In such a case a D-optimal

design may be required.

Another special case is that of mixture designs, where the main design variables are the

components of a mixture. The Unscrambler® provides the classical types of mixture designs,

with or without additional constraints.

There are several methods for analysis of experimental designs. The Unscrambler® uses

Multiple Linear Regression (MLR) as its default methods for orthogonal designs. For non-

orthogonal designs, or when the levels of a design cannot be reached, The Unscrambler®

allows the use other methods, such as PCR or PLS, for this purpose.

Raw data may have a distribution that is not optimal for analysis. Background effects,

measurements in different units, different variances in variables etc. may make it difficult for

the methods to extract meaningful information. Preprocessing or transformations help in

reducing the “noise” introduced by such effects.

Before applying transforms, it is important to look at the data from a slightly different point

of view. Sorting samples or variables and transposing the data table are examples of such

reformatting operations.

Whether the data have been reformatted and transformed or not, a quick plot may reveal

more about the data than is to be seen with the naked eye on a mere collection of numbers.

Various types of plots are available in The Unscrambler®. They facilitate visual checks of

individual variable distributions, allow one to study the correlation among two variables or

examine samples as for example a 3-D swarm of points or a 3-D landscape.

A common problem is to determine which variables actually contribute to the variation seen

in a given data matrix; i.e. to find answers to questions such as

“Which samples are similar to each other?”

“Are there groups of samples in a particular data set?”

“What is the meaning of these sample patterns?”

6

Overview

The Unscrambler® finds this information by decomposing the data matrix into a structured

part and a noise part, using a technique called Principal Component Analysis (PCA).

Classical descriptive statistics are also available in The Unscrambler®. Mean, standard

deviation, minimum, maximum, median and quartiles provide an overview of the univariate

distributions of variables, allowing for their comparison. In addition, the correlation matrix

provides a summary of the covariations among variables.

In the case of instrumental measurements (such as spectra or voltammograms) performed

on samples representing mixtures of a few pure components at varying concentrations or at

different stages of a process (such as chromatography), The Unscrambler® offers a method

for recovering the unknown concentrations, called Multivariate Curve Resolution (MCR).

Another common problem is establishing a regression model between two data matrices.

For example, one may have a set of many inexpensive measurements (X) of properties of a

set of different solutions (for example), and want to relate these measurements to the

concentration of a particular compound (Y) in the solution. The concentrations of the

particular compound are usually found using a reliable reference method.

In order to do this, it is necessary to find the relationship between the two data matrices.

This task varies somewhat depending on whether the data have been generated using

statistical experimental design or have simply been collected, more or less at random, from

a given population (i.e. non-designed data).

The variables in designed data tables (excluding mixture or D-optimal designs) are

orthogonal. Traditional statistical methods such as ANOVA and MLR are well suited to make

a regression model from orthogonal data tables.

The variables in non-designed data matrices are seldom orthogonal, but rather more or less

collinear with each other. MLR will most likely fail in such circumstances, so the use of

projection techniques such as PCR or PLS is recommended.

Whatever the purpose in multivariate modeling – explore, describe precisely, build a

predictive model – validation is an important issue. Only a proper validation can ensure that

the model results are not too highly dependent on some extreme samples, and that the

predictive power of the regression model meets the experimental objectives.

With the help of Martens’ Uncertainty Test, the power of cross validation is further

increased and allows one to:

Study the influence of individual samples in a model with powerful, simple to

interpret graphical representations;

Test the significance of the predictor variables and remove unimportant predictors

from a PLS or PCR model.

7

The Unscrambler X Main

A regression model can be used to predict new, i.e. unknown, Y-values. Prediction is a useful

technique as it can replace costly and time consuming measurements. A typical example is

the prediction of concentrations from absorbance spectra instead of direct measurements of

them by, for example titration.

Classification simply means to find out whether new samples are similar to classes of

samples that have been used to make models in the past. If a new sample fits a particular

model well, it is said to be a member of that class. Classification can be done using several

different techniques including SIMCA, LDA, SVM classification and PLS-DA.

Many analytical tasks fall into this category. For example, raw materials may be sorted into

“good” and “bad” quality, finished products classified into grades “A”, “B”, “C”, etc.

Clustering attempts to group samples into ‘k’ clusters based on specific distance

measurements.

In The Unscrambler®, clustering can be applied to a data set using the K-Means algorithm, as

well as using hierarchical clustering (HCA). Seven different types of distance measurements

are provided (including Chebyshev and Bray-Curtis) along with popular algorithms, including

Ward’s method.

Overall, The Unscrambler® is a complete, All-In-One Multivariate Data Analysis and Design of

Experiment package, which can be used to investigate simple, through to extremely large

and complex data tables, for most applications. It provides the analytical tools most

commonly used and requested by most data analysts. The plug in architecture allows for the

inclusion new transforms and methods as they become available and software validation has

been greatly simplified as a result of this. The Unscrambler® meets the data security

requirements for regulated industries.

Related topics:

User interface basics

Principles of regression

Principles of classification

Multivariate classification is split into two equally important areas: cluster analysis and

discriminant analysis.

Cluster analysis methods can be used to find groups in the data without any predefined class

structure and are referred to as unsupervised learning. Cluster analysis is highly exploratory,

but can sometimes, especially at an early stage of an investigation, be very useful.

Discriminant analysis is a supervised classification method, as it is used to build classification

rules for a number of prespecified classes. These rules (model) are later used for allocating

new and unknown samples to the most probable class. Another important application of

discriminant analysis is to help in interpreting differences between groups of samples.

8

Overview

Purposes of classification

Classification methods

SIMCA classification

Linear Discriminant Analysis

Support Vector Machines classification

PLS Discriminant Analysis

Steps in SIMCA classification

Classifying new samples

Outcomes of a classification

Classification based on a regression model

The main goal of classification is to reliably assign new samples to existing classes (in a given

population). Note that classification is not the same as clustering.

One can also use classification results as a diagnostic tool:

to distinguish among the most important variables to keep in a model (variables that

“characterize” the population);

or to find outliers (samples that are not typical of the population).

It follows that, contrary to regression, which predicts the values of one or several

quantitative variables, classification is useful when the response is a category variable that

can be interpreted in terms of several classes to which a sample may belong.

Examples of such situations are:

Predicting whether a product meets quality requirements, where the result is simply

“Yes” or “No” (i.e. binary response).

Modeling various close species of plants or animals according to their easily

observable characteristics, so as to be able to decide whether new individuals

belong to one of the modeled species.

Modeling various diseases according to a set of easily observable symptoms, clinical

signs or biological parameters, so as to help future diagnostic of those diseases.

This chapter presents the purpose of sample classification, and provides a brief overview of

the classification methods available in The Unscrambler®:

Linear Discriminant Analysis (LDA)

Support Vector Machine (SVM) Classification

Cluster analysis

Projection

for is a category group variable, and not a continuous measurement as would be the case for

a quantitative calibration (regression).

9

The Unscrambler X Main

visualization tool in data mining. One can perform clustering using either several

agglomerative methods: K-means or K-median clustering, or hierarchical clustering with

different linkage measures (single-linkage, complete-linkage, average-linkage, median-

linkage, etc.). Agglomerative methods begin by treating each sample as a single cluster and

begin clustering samples based on their similarity until one large cluster is formed.

The main categories of cluster analysis in The Unscrambler® are nonhierarchical clustering

(K-means, K-medians) and hierarchical cluster analysis (HCA).

SIMCA classification

Soft Independent Modeling of Class Analogy (SIMCA) is based on making a PCA model for

each class in the training set. Unknown samples are then compared to the class models and

assigned to classes according to their analogy to the training samples.

Linear Discriminant Analysis

Linear Discriminant Analysis (LDA) is the simplest of all possible classification methods that

are based on Bayes’ formula. The objective of LDA is to determine the best fit parameters

for classification of samples by a developed model. The model can then be used to classify

unknown samples. It is based on the normal distribution assumption and the assumption

that the covariance matrices of the two (or more) groups are identical.

Support Vector Machines classification

Support Vector Machines (SVM) is a classification method based on statistical learning.

Sometimes, a linear function is not able to model complex separations, so SVM employs

kernel functions to map from the original space to the feature space. The function can be of

many forms, thus providing the ability to handle nonlinear classification cases. The kernels

can be viewed as a mapping of nonlinear data to a higher dimensional feature space, while

providing a computation shortcut by allowing linear algorithms to work with higher

dimensional feature space.

PLS Discriminant Analysis

The discriminant analysis approach differs from the SIMCA approach in that it assumes that

a sample has to be a member of one of the classes included in the analysis. The most

common case is that of a binary discriminant variable: a question with a Yes / No answer.

Binary discriminant analysis is performed using regression, with the discriminant variable

coded 0 / 1 (Yes = 1, No = 0) as the Y-variable in the model.

With PLS, this can easily be extended to the case of more than two classes. Each class is

represented by an indicator variable, i.e. a binary variable with value 1 for members of that

class, 0 for non-members. By building a PLS model with all indicator variables as Y, one can

directly predict class membership from the X-variables describing the samples. The model is

interpreted by viewing the Predicted vs. Reference plot for each class indicator Y-variable:

Ypred < 0.5 means “roughly 0” that is to say non-member.

Once the PLS model has been checked and validated (see the chapter about multivariate

regression for more details on diagnosing and validating a model), one can run a Prediction

in order to classify new samples. The prediction results are interpreted by viewing the plot

Predicted with Deviations for each class indicator Y-variable:

10

Overview

Samples with Ypred > 0.5 and a deviation that does not cross the 0.5 line are

predicted members;

Samples with Ypred < 0.5 and a deviation that does not cross the 0.5 line are

predicted nonmembers;

Samples with a deviation that crosses the 0.5 line cannot be safely classified.

See Chapter Prediction for more details on how to run a prediction and interpret results. A

tutorial explaining PLS-DA in practice is also available: PLS Discriminant Analysis.

Solving a classification problem requires two steps:

Modeling: Build one separate model for each class;

Classifying new samples: Fit each sample to each model and decide whether the

sample belongs to the corresponding class.

The modeling stage implies that enough samples have been identified as members of each

class to be able to build a reliable model. It also requires enough variables to describe the

samples accurately.

The actual classification stage uses significance tests, where the decisions are based on

statistical tests performed on the object-to-model distances.

Once each class has been modeled, and provided that the classes do not overlap too much,

new samples can be fitted to (projected onto) each model. This means that for each sample,

new values for all variables are computed using the scores and loadings of the model, and

compared to the actual values.

The residuals are then combined into a measure of the object-to-model distance.

The scores are also used to build up a measure of the distance of the sample to the model

center, called leverage.

Finally, both object-to-model distance and leverage are taken into account to decide which

class(es) the sample belongs to.

The classification decision rule is based on a classical statistical approach. If a sample belongs

to a class, it should have a small distance to the class model (the ideal situation being

“distance=0”). Given a new sample, one needs to compare its distance to the model to a

class membership limit reflecting the probability distribution of object-to-model distances

around zero.

There are three possible outcomes of a classification:

Unknown sample belongs to several classes;

Unknown sample belongs to none of the classes.

If the classes have been modeled with enough precision, the second case should not occur

(no overlap). If it does occur, this means that the class models might need improvement, i.e.

more calibration samples and/or additional variables should be included.

11

The Unscrambler X Main

The last case is not necessarily a problem. It may be a quite interpretable outcome,

especially in a one-class problem. A typical example is product quality prediction, which can

be done by modeling the single class of acceptable products. If a new sample belongs to the

modeled class, it is accepted; otherwise, it is rejected.

Throughout this chapter, SIMCA classification is described as a method involving disjoint PCA

modeling. Instead of PCA models, one can also use PCR or PLS models. In those cases, only

the X-part of the model will be used. The results will be interpreted in exactly the same way.

SIMCA classification based on the X-part of a regression model is a nice way to detect

whether new samples are suitable for prediction. If the samples are recognized as members

of the class formed by the calibration sample set, the predictions for those samples should

be reliable. Conversely, one should avoid using any model for extrapolation, i.e. making

predictions on samples which are rejected by the classification.

Besides, classification may be achieved with a regression technique called Linear

Discriminant Analysis (LDA), which is an alternative to SIMCA.

The help system has been implemented to provide help and advice to those working with

The Unscrambler®. Help covers use of the dialogs and methods, and interpretation of plots.

For best viewing of the contents users are recommended to have Internet Explorer 7.0 or

higher.

Browsing the contents

Searching the contents

Typographic cues

Press the F1 key or click on the ? help button near the top right corner of the active dialog

window to read help for the appropriate topic.

The help documentation can also be opened for browsing by selecting Help - Contents from

the menu, or pressing the Help button in the toolbar.

Several levels of help are available. Click on underlined words to follow built-in hypertext

links to related topics.

The Help documentation can be read as a book by clicking through the chapters and

sections, accessing chapters from the table of contents displayed to the left.

The left-most window consists of two tabs for switching between a Contents hierarchical

view, and the Search utility.

The search engine allows one to search for occurrences of one or several words. Select a

page from the result list to read it.

12

Overview

Use Find in page to search for a phrase within the current page.

The help documentation text itself provides typographic cues to the reader:

Emphasized text (italic) indicate important concepts, or variables.

Strong emphasis (bold) indicate actions, e.g. a menu entry or button.

Dotted underline indicate abbreviations. Hover the mouse pointer over such text for

a tooltip explanation for the acronym.

Computer code text indicate file name selectors like *.unsb, and command input

such as X=sqrt(X).

A globe icon indicates that the hypertext link will open external content in the

system default web browser, such as http://www.camo.com/

A table grid icon indicates that the hypertext link will open, import or download a

data set, like this: Import the tutorial A data

Hovering the mouse pointer over figures will display the caption as a tooltip.

Useful tips are put in text boxes like this.

Caution notes are put in text boxes like this.

Regression is used to find out about how well some predictor variables (X) explain the

variations in some response variables (Y) using methods such as MLR, PCR, PLSR and L-PLSR.

What is regression?

General notation and definitions

The whys and hows of regression modeling

What is a good regression model?

Regression methods in The Unscrambler®

Multiple Linear Regression (MLR)

Principal Component Regression (PCR)

Partial Least Squares Regression (PLSR)

L-PLS Regression

Support Vector Machine Regression (SVMR)

Calibration, validation and related samples

Main results of regression

Making the right choice with regression methods

How to interpret regression results

How to detect nonlinearities (lack of fit)

What are outliers and how are they detected?

Guidelines for calibration of spectroscopic data

Regression is a generic term used for all methods that attempt to model and analyze several

variables with the purpose of building a relationship between two groups of variables,

namely the independent and dependent variables. The fitted model may then be used to

either just describe the relationship between the two groups of variables, or to predict new

values.

13

The Unscrambler X Main

The two data matrices involved in regression are usually denoted X (independent,

predictors) and Y (dependent, responses), and the purpose of regression is to build a model

. Such a model is used to explain, or predict, the variations in the Y-variable(s)

from the variations in the X-variable(s). The link between X and Y is achieved through a

common set of samples for which both X- and Y-values have been collected.

Names for X and Y

The X- and Y-variables can be denoted with a variety of terms, according to the particular

context (or culture). The most common ones are listed in the table below:

Usual names for X- and Y-variables

Context X Y

Univariate regression uses a single predictor to define a relationship with a response. The

classical example in chemistry is the Beer-Lambert law for spectroscopy, where a straight

line model is established to relate concentration to absorbance. In this case, physical sample

preparation is required to “clean the signal” to ensure that the relationship between

absorbance and concentration holds. However, in most practical applications a single

predictor is not sufficient to model a property precisely. The form of the model is described

by,

Where b0 is an intercept term and b1 is a regression coefficient; in this case, the slope of the

straight line.

Multivariate regression takes into account several predictor variables, thus modeling the

property of interest with more accuracy. The form of the model is

Where the terms in the equation are defined as usual. This chapter focuses on the general

principles of multivariate regression.

The whys and hows of regression modeling

Building a regression model involves collecting the predictors and the corresponding

response values for a set of samples, and then finding the optimal parameters in a

predefined mathematical relationship to the collected data. A commonly used measure of

optimality is the minimization of the sum of squares of the deviations between the

measured and predicted responses.

For example, in analytical chemistry, spectroscopic measurements are made on solutions

with known concentrations of a component of interest. Regression is then used to relate the

concentration of the component of interest to the spectrum.

14

Overview

Once a regression model has been built, it can be used to predict the unknown

concentration for new samples, using the spectroscopic measurements as predictors. The

advantage is obvious if the concentration is difficult or expensive to measure directly.

Replacement with the spectroscopic method is less expensive and in some cases, requires

minimal to no sample preparation. It also allows for development of spectroscopic

measurements for real-time process monitoring.

The most common motivations for developing regression models as predictive tools may

include:

Replacement of expensive or time-consuming analysis methods, with cheap, rapid,

easy-to-perform measurements (e.g. NIR spectroscopy, mass spectrometry for gas

analysis).

When one wants to build a response surface model from the results of some

experimental design, i.e. describe precisely the response levels according to the

values of a few controlled factors.

What is a good regression model?

The purpose of a regression model is to extract all the information relevant for the

prediction of the response from the available data.

Unfortunately, observed data usually contains some amount of noise and in some cases,

irrelevant information.

Noise can be random variation in the response due to experimental error, or it can be

random variation in the data values due to measurement error. It may also be some amount

of response variation due to factors which are not included in the model.

Irrelevant information is carried by predictors which have little or nothing to do with the

modeled phenomenon. For instance, NIR absorbance spectra may carry some information

relative to the solvent and not only to the compound of interest in developing a model to

predict the concentration of the compound in solution.

A good regression model should be able to:

Model only relevant information, by highly weighting these sources of information

and downweighting any irrelevant variation.

Avoid overfitting, i.e. distinguish between variation in the response (that can be

explained by variation in the predictors), and variation caused by mere noise.

Regression methods in The Unscrambler®

The Unscrambler® provides five regression method choices:

Principal Component Regression (PCR)

Partial Least Squares Regression (PLSR)

L-PLSR Regression

Support Vector Machine Regression

MLR is a well-known statistical method based on ordinary least squares regression. It

estimates the model coefficients by the equation:

This operation involves a matrix inversion, which can be numerically unstable when there is

collinearity, that is when the variables are not linearly independent. Incidentally, this is the

15

The Unscrambler X Main

reason why the predictors are called independent variables in MLR; the ability to vary

independently of each other is a crucial requirement to variables used as predictors with this

method. MLR requires more samples than predictors since the system with more variables

than samples would not have a unique solution.

The Unscrambler® uses The QR Decomposition to find the MLR solution. No missing values

are accepted.

More details about MLR regression can be found in the section Multiple Linear Regression

(MLR)

PCR is a two-step procedure which first decomposes the X-matrix by PCA, then fits an MLR

model, using the PCs instead of the original X-variables as predictors.

PCR procedure

More about PCR can be found in the help section Principal Component Regression (PCR)

More information about the PCR algorithm can be found in Method References.

Partial Least Squares regression (PLSR, sometimes referred to as Projection to Latent

Structures or simply PLS) models both the X- and Y-matrices simultaneously to find the

latent variables in X that will best predict the latent variables in Y. These PLSR components

are similar to principal components; however, they are referred to as factors.

PLSR procedure

16

Overview

More about PLS regression can be found in the help section Partial Least Squares Regression

(PLSR)

More details regarding the PLSR algorithm are given in the Method References.

Traditionally, science demanded that a one-to-one relationship between a cause and effect

existed; however, this tradition can hinder the study of more complex systems. Such systems

may be characterized by many-to-many relationships, which are often hidden in large tables

of data.

In some cases, the Y data may have descriptors of its columns, organized in a third table Z

(containing the same number of columns as in Y).

The three matrices X, Y and Z can together be visualized in the form of an L-shaped

arrangement. Such data analysis has potential widespread use in areas such as consumer

preference studies, medical diagnosis and spectroscopic applications.

17

The Unscrambler X Main

More about L-PLS regression can be found in the help section L-PLS Regression

More details regarding the L-PLSR algorithm are given in the Method References.

Unlike the bilinear methods of PCR/PLSR, Support Vector Machine SVMR uses kernels to

transform non-linear systems into linear systems before the application of regression. This is

done by selecting an appropriate kernel and fine tuning its parameters to achieve an

acceptable result (if such a result exists).

A simple diagrammatic representation of SVMR is provided below,

How SVMR Works

More about SVMR can be found in the help section Support Vector Machine Regression

(SVMR)

More details regarding the SVMR algorithm are given in the Method References.

All regression modeling must include some form of validation (i.e. testing) to make sure that

the results obtained can be applied to new data. This requires two separate steps in the

computation of a model, whether it be PCA, MLR, PCR, PLSR, etc.

Calibration

Modeling the relevant information in a set of data used as a training set.

18

Overview

Validation

Checking whether the model is capable of performing its task on a separate test set

of data.

Calibration is the fitting stage in the regression modeling process. The main data set,

containing only the calibration sample set, is used to compute the model parameters (PCs,

regression coefficients).

It is essential to validate models to get an idea of how well a regression model will perform

when it is used to predict new, unknown samples. A test set consisting of samples with

known response values is used. Only the X-values are fed into the model, from which

response values are predicted and compared to the known, actual response values. The

model is validated if the prediction residuals are low and there is no evidence of lack of fit in

the model.

Each of the two steps described above requires its own set of samples; thus, the following

terms are used interchangeably calibration samples = training samples and validation

samples = test samples.

A more detailed description of validation techniques and their interpretation is to be found

in the chapter Validate a Model.

The main results of a regression analysis vary depending on the method used. They may be

roughly divided into two categories:

Diagnosis

results that are used to check the validity and quality of the model;

Interpretation

results that provide mechanistic insights into the relationship between X and Y, as

well as (for projection methods only) sample properties.

Note: Some results, e.g. scores, may be considered as belonging to both categories

(scores can help in the detection of outliers, and they also give information about

differences or similarities among samples).

The table below lists the various types of regression results computed in The Unscrambler®,

their application area (diagnosis or interpretation) and the regression method(s) for which

they are available.

Regression results available for each method

Result Appl. MLR PCR PLSR

B-coefficients I X X X

Residuals 1 D X X X

Error Measures D X X X

ANOVA D X

19

The Unscrambler X Main

In short, all three regression methods give a model with an equation expressed by the

regression coefficients (b-coefficients), from which predicted Y-values are computed. For all

methods, residuals can be computed as the difference between predicted (fitted) values and

actual (observed) values; these residuals can then be combined into error measures that tell

how well a model performs.

PCR and PLSR, in addition to those standard results, provide powerful interpretation and

diagnostic tools linked to projection: more elaborate error measures, as well as scores and

loadings.

The simplicity of MLR, on the other hand, allows for simple significance testing of the model

with ANOVA and of the b-coefficients with a Student’s t-test (ANOVA will not be presented

hereafter; read more about it in the ANOVA section from Chapter “Analyze Results from

Designed Experiments”.) However, significance testing is also possible in PCR and PLSR, using

Martens’ Uncertainty Test.

B-coefficients

The regression model can be written

meaning that the observed response values (Y) are approximated by a linear combination of

the values of the predictors (X). The coefficients of that combination are called regression

coefficients or B-coefficients.

Several diagnostic statistics are associated with the regression coefficients (available only for

MLR):

Standard error is a measure of the precision of the estimation of a coefficient;

From that, a student’s t-value can be computed;

Comparing the t-value to a reference t-distribution will then yield a significance level or p-

value. It provides an indication that the regression coefficients are significantly different

from 0. If the t-value is found to be nonsignificant this means that the regression coefficient

cannot be distinguished from 0.

Predicted Y-values

Predicted Y-values are computed for each sample by applying the model equation (i.e. the B-

coefficients) to new (or existing) observed X-values.

For PCR or PLSR models, the predicted Y-values can also be computed using projection along

the successive components of the model. This has the advantage of diagnosing samples

which are badly represented by the model, and therefore have high prediction uncertainty.

This is discussed more fully in the chapter Predictions.

Residuals

For each sample, the residual is the difference between the observed Y-value and the

predicted Y-value. It appears as the term e in the model equation.

More generally, residuals may also be computed for each fitting operation in a projection

model: thus the samples have X- and Y-residuals along each PC (factor) in PCR and PLSR

models. Read more about how sample and variable residuals are computed in the chapter

More Details About the Theory of PCA.

20

Overview

In PCR and PLSR models, scores and loadings express how the samples and variables are

projected along the model components.

PCR uses the same scores and loadings as PCA, since PCA is used in the decomposition of X. Y

is then projected onto the “plane” defined by the MLR equation, and no extra scores or

loadings are required to express this operation.

Read more about PCA scores and loadings in Chapters PCA and How to Interpret PCA Scores

and Loadings. PCR and PLSR scores and loadings are presented in the relevant sections for

these topics.

L-PLSR is further described in the method section on this topic. L-PLSR

It may be somewhat confusing to have a choice between three different methods that

apparently solve the same problem, i.e. fit a model in order to approximate Y as a linear

function of X.

The sections that follow provide a comparison of the three methods and may aid in selecting

the one which is best suited to specific analysis objectives.

MLR has the following properties and behavior:

The number of X-variables must be smaller than the number of samples;

In case of collinearity among X-variables, the b-coefficients are not reliable and the

model may be unstable;

MLR tends to overfit when noisy data are used.

PCR and PLSR are projection methods, like PCA.

Model components are extracted in such a way that the first PC/factor explains the largest

amount of variation, followed by the second PC/factor, etc. At a certain point, the variation

modeled by any new PC/factor is mostly noise. The optimal number of PCs/factors -

modeling useful information, but avoiding overfitting - is determined with the help of the

residual variances.

PCR uses MLR in the regression step; a PCR model using all PCs gives the same solution as

MLR (as does a PLSR model using all factors).

If one were to run MLR, PCR and PLSR on the same data, their performance could be

compared by checking validation errors (Predicted vs. Measured Y-values for validation

samples, RMSEP).

It should also be noted that both MLR and PCR can only model one Y-variable at a time.

The difference between PCR and PLSR lies in the algorithm. PLSR uses the information lying

in both X and Y to fit the model, switching between X and Y iteratively to find the relevant

factors. So PLSR often needs fewer factors to reach the optimal solution because the focus is

on the prediction of the Y-variables (not on achieving the best projection of X as in PCA).

SVMR is a special class of regression that is very distinct from all of the methods described

above. SVMR uses kernels to map variable space to feature space in order to minimise

particular errors associated with the calibration development. This is done by

Selecting a specific kernel function that is capable of mapping the variable space.

Fine tuning the parameters of the chosen function such that the best calibration and

prediction statistics are achieved.

21

The Unscrambler X Main

SVMR provides the least graphical output and diagnostics statistics of all the regression

methods implemented in The Unscrambler® and can often pose a difficult task for the user

to develop robust models. However, when they work, SVMR models are much better able to

handle non-linearities than MLR/PCR/PLSR models and can provide an alternative method to

Artificial Neural Networks (ANN).

If there is more than one Y-variable, PLSR is usually the best method if the objective is to

interpret all variables simultaneously. It is often argued that PLSR or PCR gives better

prediction ability. This is usually true if there are strong nonlinearities in the data, in which

case modeling each Y-variable separately according to its own nonlinear features might

perform better than trying to build a common model for all Ys. On the other hand, if the Y-

variables are somewhat noisy, but strongly correlated, PLSR is the best way to model the

whole information and minimize the influence of noise.

The difference between PLSR and PCR in prediction error is usually quite small, but PLSR will

usually give results comparable to PCR results using fewer components.

MLR should only be used if the number of X-variables is low (around 20 or less) and there

are only small correlations among them.

Formal tests of significance for the regression coefficients are well-known and accepted for

MLR. If using PCR or PLSR, one can check the stability of the results and the significance of

the regression coefficients with Martens’ Uncertainty Test.

SVMR should be considered when it is known a priori that non-linearity will affect the

system and attempts should be made to find a kernel function that best handles this.

Once a regression model is built, one needs to to diagnose it, i.e. assess its quality, before

interpreting the relationship between X and Y. Finally, the model will be ready for use for

prediction once it has been thoroughly checked and refined.

The various types of results from MLR, PCR and PLS regression models and more information

about the interpretation of projection results (scores and loadings) and variance curves for

PCR and PLSR can be found in the corresponding chapters covering each method.

How to detect nonlinearities (lack of fit)

Different types of residual plots can be used to detect nonlinearities or lack of fit. If the

model is good, the residuals should be randomly distributed, and these plots should be free

from systematic trends.

The most useful residual plots are the Y-residuals vs. predicted Y and Y-residuals vs. scores

plots. Variable residuals and Normal Probability Plots can also be useful.

The PLSR X-Y Relation Outliers plot is also a powerful tool to detect nonlinearities, since it

shows the shape of the relationship between X and Y along one specific model factor.

What are outliers and how are they detected?

An outlier is an object which deviates from the other objects in a model and may not belong

to the same population as the majority and therefore can disturb the model.

The cause of outliers could be one or more of the following:

Measurement error

Wrong labeling

22

Overview

Noise

Extreme / interesting sample

For projection methods like PCA, PCR and PLSR, outliers can be detected using scores plots,

residuals, leverages and influence plots.

Outliers in regression

In regression, there are many ways for a sample to be classified as an outlier. It may be

outlying according to the X-variables only, or to the Y-variables only, or to both. It may also

not be an outlier for either separate set of variables, but become an outlier when one

considers the (X,Y) relationship. In the latter case, the X-Y Relation Outliers plot (only

available for PLSR) is a very powerful tool showing the (X,Y) relationship and how well the

data points fit into it.

Use of residuals to detect outliers

One can use the residuals in several ways. For instance, first use residual variance per

sample plot, then use a variable residual plot to detect samples with large squared residual

in the first plot. The first of the two plots is used for indicating samples with outlying

variables, while the latter plot is used for a detailed study for each of these samples. In both

cases, points located far from the zero line indicate outlying samples or variables.

Use of leverages to detect outliers

The leverages are usually plotted vs. sample number. Samples showing a much larger

leverage than the rest of the samples may be outliers and may have had a strong influence

on the model, which should be avoided.

For calibration samples, it is also natural to use an influence plot. This is a plot of squared

residuals (either X or Y) vs. leverages. Samples with both large residuals and large leverage

can then be detected. These are the samples with the strongest influence on the model, and

may disturb (influence) the model towards themselves.

The features of two plots can be utilized by plotting influence and Y-residuals vs. predicted Y

together. Some example plots are shown below:

Scores plot showing a gross outlier

23

The Unscrambler X Main

The information described in this chapter so far has presented the basics of calibration. The

following steps and useful functions may be used as a guideline for the development of

spectroscopic calibration models.

Read data

File - Open or File - Import Data. Data can be imported from many vendor

instrument formats — directly or via e.g. JCAMP-DX, GRAMS SPC or ASCII.

See full details on compatible formats in the chapter on Importing data

View and prepare data

View data as a spreadsheet in the Editor, define sets using the Define Range option.

Select some samples and Plot - Line or Matrix to get an overview of the spectra

(data plot). Histograms of Y-variables are also useful to assess the spread of the data

for calibration. 3-D scatter plots can be used as an initial assessment of any

covariance between numerous constituents, if there are several present in the

24

Overview

analysis. All of these plots can be helpful in detecting outliers, or possible errors in

the data.

Note: It is advisable to aim for a boxcar distribution of Y-values, as this provides the

most even coverage of the region of interest.

Preprocess (transform the data)

Tasks - Transform… allows for spectroscopic transformations, derivation,

smoothing, etc. Tasks - Transform - Reduce (Average) may also be useful when

replicates have been measured, or variable reduction is required. The Preview Result

option in the transform dialog, provides a graphical preview of spectral data as

transform parameters are changed. These changes are presented to the user in real

time.

Statistics

Tasks - Analyze - Descriptive Statistics… may be used to reveal scatter effects and

for visually detecting large changes in specific wavelength regions. Use the Scatter

option to reveal potential scatter effects before the application of transforms such

as Multiplicative Scatter Correction (MSC).

Select samples

The Edit - Mark option is useful for selecting a more balanced data set from a large

data set from PCA, PCR or PLSR scores. This can be applied to either the spectra or

the constituents (if more than one component is being analyzed). Mark samples that

span all the important components (samples far away from the origin, including the

extremes when selecting calibration samples). Use the Create Range option to

extract marked samples as a new row set in the project navigator.

Reduce spectra

Use the Tasks- Transform- Reduce(Average)… options to reduce spectra of high

data point spacings (being careful not to lose resolution) to fewer data points, or

average out replicate spectra in a data set.

Make a first calibration model and look for outliers

Tasks - Analyze - Partial Least Squares Regression… with more than one response

variable (Y) gives a simple overview for several constituents. Otherwise run PLSR

with a single response, or PCR or MLR, which use only a single response. View the

results, especially Variance plots, Scores and Predicted vs. Reference plots. Use Edit -

Mark (also available as right mouse button option) to mark suspicious samples in the

scores plots. Use Plot - Sample outliers and XY Relation outliers to investigate

potential outliers.

Refine the Model

After marking samples one can go to the analysis (i.e. PLSR) node in the project

navigator and right click to select Recalculate - Without Marked, which allows the

calculation of a new model with the marked samples removed. Compare results, and

look for additional outliers. Repeat this process if necessary.

Study the model in detail

Plot the results including Variances and RMSEP - RMSE, Important variables,

Predicted vs. reference, loadings as these are useful tools for assessing model

quality. View the regression lines and statistics in the predicted vs. reference plot, as

these are helpful for assessing the model fit. Highlight samples in scores plots by

groups using the Sample grouping available as a right mouse button option, for

25

The Unscrambler X Main

investigating interesting patterns in the data. View the loadings as line plots and see

if the variables of importance coincide with the spectral regions related to the

property being measured.

Delete variables (wavelengths).

From the Important variables plot the Edit - Mark option can be used to define

ranges in the spectra that are not important (potentially due to noise). Use the

Recalculate - Without Marked option to generate a new model based on fewer

wavelengths. Apply the Uncertainty test during PLS regression to aid in the

identification of important variables for modeling.

Validation

It is essential to ensure that a developed model is properly validated using a suitable

validation method (cross validation or test set validation). Cross validation can be set

up to look at the effect of removing an entire set of replicates from an analysis or

single replicates can be removed to test the predictive ability of the model for single

replicates.

Access to results

All of the models that have been created in a project are stored as analysis nodes in

that project and can be accessed from the project navigator. The Save Model option

can be accessed by right clicking on an analysis node, allowing one to save the model

as an independent file from the project. This allows the models only to be shared

with others and not the entire project. The models can be used in real-time via The

Unscrambler® Process Pulse, and with The Unscrambler® Predictor/Classifier

(OLUP/OLUC). It is also the way The Unscrambler® Online Predictor/Classifier will

use models for online and 3rd party applications. More on this is discussed in the

Instrument Compatibility section below.

Detailed information about the model is stored in the results and validation folders under a

particular analysis node. A summary is available in the Info box in the lower left part of the

display, when the model name is highlighted.

Predict new samples

Tasks - Predict - Regression… is used to predict Y-values for new unknown samples

from spectra. If new samples have known reference values available, these can used

in the Predict option to assess the quality of new predictions during the validation

stage of model development. The prediction also provides the uncertainty of the

measurements and additional statistics to show the similarity of the prediction

samples to the calibration samples. Reproducibility can also be assessed in terms of

samples measured on different instruments, or from different manufacturing sites,

etc by applying a model developed on one spectrometer to spectra scanned on

another instrument. Remember to preprocess new samples in the same way as the

original calibration samples used to develop the model (which can readily be done

using Autopretreatments).

Check the robustness of calibration models

By using Tasks - Transform - Noise various amounts of additive or multiplicative

noise can be added to new samples to see how sensitive the model is to small

changes. In the project navigator, under the Validation folder, the Prediction

Diagnostics matrix is available for regression methods. Assess the numerical values

of all results, checking that bias is close to 0 and slope is close to 1. Otherwise there

may be a need to slope and bias adjust the predicted Y-values (e.g. the spectra may

26

Overview

systematic differences in the reference values from another laboratory). SEPcorr

provides a bias corrected SEP value, i.e. the expected predicted error in the absence

of systematic bias.

Audit Trails

The Tools-Audit Trail… option provides a non-editable record of all imports, analyses

and manipulations made to a project. It is especially useful in regulated

environments requiring compliance to 21 CFR part 11. All saves and project entries

are also recorded in the audit trail.

When predictive models have been optimized to meet certain desirable criteria, i.e. the

predictive ability on new samples is satisfactory, these models may be used in third party or

The Unscrambler® based applications, such as The Unscrambler® Online Predictor/ Classifier

and The Unscrambler® X Process Pulse.

Instrument compatibility

Some instrument vendors (for example Perten, Brimrose, Guided Wave, Foss NIRSystems,

Thermo, etc.) make use of The Unscrambler® Online Predictor/ Classifier software available

for integration of The Unscrambler® models into third party systems. These packages are

DLL-based programs that are incorporated into the instrument software, allowing the use of

The Unscrambler® predictive or classification models on the data, providing the model

results to the instrument interface for either graphical or numerical display when a new

(spectral) measurement is made. Visit http://www.camo.com/ for more information on

these applications.

The Unscrambler® X uses the Save Model option to save predictive, or classification models

as separate files from a project. The Unscrambler® Generation X family of online software

uses these model files directly for applications. The Unscrambler® X is backward compatible

for use in previous versions of The Unscrambler® Online Predictor and Classifier (back to

version 9.2). Use the File-Export-Unscrambler option to export model files for use in these

previous versions. This option will allow users to save data or model for backward

compatibility. Contact CAMO for this plug-in option.

Some instrument software can read the B vector (regression coefficients). Use File - Export -

ASCII…, or JCAMP-DX. Use File - Export - ASCII MOD… , which is a simple file format

containing all information necessary to make predictions, either using full PLSR or PCR

models, or just the B vector. It can be used with user-defined conversion routines.

Use The Unscrambler® to develop models for instruments that do not support The

Unscrambler® Online Predictor/Classifier

If an instrument vendor software does not support The Unscrambler® developed

models, import the instrument data as a common format, i.e. ASCII Excel, JCAMP etc

and develop a model using the powerful diagnostic and algorithmic capabilities. Use

this model to select appropriate calibration and validation samples, determine the

optimal PCs/factors to use and match the preprocessing to the options available in

the vendor software. Redevelop the model in the vendors’ software and compare

the two results. This will provide added assurance that the developed model is

robust and performs as required.

The various residuals and error measures are available for each PC in PCR and PLSR,

while for MLR there is only one of each type

27

The Unscrambler X Main

There are two types of scores and loadings in PLSR, only one in PCR

Watch this video to become familiar with the new user interface in The Unscrambler® X.

The video provides a guided tour of some of the basic operations in the software

application. This will show the project-based structure of The Unscrambler®, how to import,

view and analyze data. The video gives an overview of using the project navigator which

incorporate raw data, transformed data, and all the results of analysis within a given project.

Note: This video was created using The Unscrambler® X version 10.0. The current

version of the software has a slightly different look and feel and even more

functionality.

An Internet connection and Adobe Flash Player is required to play the above video.

28

4. Application Framework

4.1. User interface basics

The purpose of this chapter is to give the user an overall introduction to the principles used

in The Unscrambler®. A short overview of The Unscrambler® user interface and workplace is

provided in this section, covering the various menu options, and the data organization

environment:

Matrix editor

Project navigator

Menu walk-through:

File

Edit

View

Insert

Plot

Tasks

Tools

Help

File

Export >

Print…

Edit

Go to…

Change data type – Category…

Define range…

Group rows…

Sample grouping…

Insert

Data matrix…

Duplicate matrix…

Custom layout…

Tools

29

The Unscrambler X Main

Matrix calculator…

Report generator…

Audit trail…

Options…

Help

Modify license…

User setup…

This will introduce terminology related to the user interface in The Unscrambler®. It is

assumed that the user is already familiar with using the operating of his computer.

Application window

Workspace

Editor

Viewer

Project navigator

Project information

Page tab bar

The menu bar

The toolbar

The status bar

Dialogs

Setting up the user environment

Getting help

The application window layout is composed to give an overview of the work currently being

done.

The below screenshot shows the application with its menu bar, toolbar, the project

navigator and project information panes on the left, the workspace in editor mode mode

(there is also a viewer mode), and the page tab bar below it. The status bar at the bottom

shows a summary of the selected content and status while The Unscrambler® is calculating.

The Unscrambler® main window

30

Application Framework

4.2.2 Workspace

The Workspace occupies the largest area of the application window, containing either a

table view of a data set, called the Editor, or a Viewer which displays results either

graphically as plots or numerically as tables.

Editor

The Editor presents a data table that may or may not be modified depending on its

protection status:

If a table can be edited, it is possible to:

Type in values.

Change the column and row headers.

Create ranges.

Viewer

In the Viewer, data and results are visualized graphically in an interactive manner.

Whenever data are plotted, the plot appears in a Viewer. Every time the Viewer is

mentioned throughout this manual and help system, it refers to a window where a plot is

displayed.

The information in the viewer can come from:

Plotting raw data from the editor: either for a data matrix or a matrix from a result.

Displaying predefined plots.

31

The Unscrambler X Main

Custom layout.

To learn more about working in this mode, please refer to the chapter on plotting data.

The project navigator is a tree-like structure consisting of data matrices and analysis results

along with plots.

All raw and modified data sets along with different analysis results and plots can be stored

as a single project. One can toggle between different data sets and analysis results just by

selection.

The Project information pane, found in the lower left corner of the display has two tabs:

Info and Notes.

Info

Include details about the currently selected item in the project navigator, such as

the matrix or model name, matrix shape, creation time and type of input,

parameters used for output matrices, plots and results.

Notes

Annotations are saved in notes.

More information about a project are found in the audit trail.

At the bottom of the Workspace there is a list of recent views. It acts as a “breadcrumb trail”

of what has been viewed recently.

When reopening a file, only the most recently active view will be available.

By right clicking on a tab and selecting Pop out, the item becomes a separate window, that

can be moved around and placed as a side-by-side view.

It is also possible to close the current tab, all other tabs or all tabs via this menu.

All operations in The Unscrambler® are performed with the help of the menus and options

available in the menus.

Available menu actions will change depending on context; Editor or Viewer mode, or the

currently selected plots. Some submenus and options may be invalid in a given context;

these are grayed out.

Context-sensitive menus

The Unscrambler® also features so-called context-sensitive menus. These can be accessed by

clicking the right mouse button while the cursor rests on the area on which an operation is

performed. The context-sensitive menus are a kind of shortcut, as they contain only the

32

Application Framework

options which are valid for the selected area, which will save a user the work of having to

click through all the menus on the Menu bar.

The Toolbar buttons provide shortcuts to the most frequently used commands. When the

mouse cursor is rested on a toolbar button, a short tooltip explanation of its function

appears.

The Status bar at the bottom of the screen displays concise information including:

Short explanation of the current menu option.

the size of the data table.

4.2.9 Dialogs

The Unscrambler® aims to aid the user through dialogs that provide detailed instructions to

the application.

When working in The Unscrambler® the user will often have to enter information or make

choices in order to be able to complete an analysis. This includes activities such as specifying

the names of data matrices/files to work with, the data sets to analyze, how many PCs to

compute, or the type of validation methods to choose. This is done in dialogs, which will

normally look something like the one pictured below.

The Unscrambler® dialog

33

The Unscrambler X Main

This particular dialog is the one associated with running a Principal Component Analysis on

data. Items that are predefined, such as rows/samples, columns/variables, etc. are selected

from a drop-down list. Options which are mutually exclusive are selected via radio buttons.

The settings for many of the analysis dialogs will be remembered from the last time the

dialog was open.

Any dialog can also be canceled by pressing the Esc (escape) key on the keyboard. Ongoing

calculations can also be aborted pressing Esc.

The Unscrambler® provides user authentication to offer traceability required by regulations.

See the documentation for the Login dialog for how to make use of this facility, and set up a

user.

The look and feel of the workspace can be customized. See the documentation for the Tools

– Options… dialog for more information.

Documentation for currently open dialogs can be accessed by pressing F1, or by using the ?

button near the top right corner of the active dialog window.

See How to use help and the Help menu for more details.

This is an introduction to the matrix editor.

34

Application Framework

What is a matrix?

Matrix structure

Samples and variables

Adding data matrices

Manually

Drag and drop from other applications

Altering data tables

Using ranges

Create ranges to organize subsets

Superimposed ranges

Storing data as separate matrices

Data types

Possible data types

Converting data types

Keeping versions of data

Saving data

A matrix is a rectangular table of numbers.

The horizontal lines in a matrix are called rows and the vertical lines are called columns. A

matrix with m rows and n columns is called an m-by-n matrix (or m×n matrix) and m and n

are called its dimensions.

The places in the matrix where the numbers are, are called entries. The entry of a matrix A

that lies in the row number i and column number j is called the i,j entry of A. This is written

as Ai,j or aij.

Matrix structure

The matrix A with M rows and N columns is defined as A(M,N) and can be represented as

shown below.

A11 A12 A13 … A1N

… … … … …

Matrices consisting of only one column or row are called vectors, while higher-dimensional,

e.g. three-dimensional, arrays of numbers are called tensors. Matrices can be added and

subtracted entry wise, and multiplied according to a rule corresponding to composition of

linear transformations. For more details on operations possible using matrices look into the

Matrix calculator

Samples and variables

A matrix represents the values associated to samples and variables. An entry corresponds to

the value of a specific sample for a specific variable. The general way of presenting data in a

matrix is to place the samples in row and the variables in column.

Variable 1 Variable 2 Variable 3 … Variable N

35

The Unscrambler X Main

… … … … … …

To create a data table in The Unscrambler®, there are three options:

Create a design table

Import data

See insert matrix dialog box for more information on how to create a blank table, fill it with

data and rename it.

Manually

Enter data manually into a matrix by simply typing while an entry is focused, double clicking

on a specific entry, or pressing F2 and entering the value. This operation can be done for the

data table as well as the sample and variable name.

Category entries have a drop-down list, allowing the user to select one of the levels already

used. It can also be typed, and it is possible to type anything to add new levels.

Date-time entries have a calendar pop-out, allowing the user to pick a date from it.

Drag and drop from other applications

Data can be copied from any application, e.g. Microsoft Excel, to The Unscrambler® by either

drag and drop, or by copy and paste.

Files can also be dragged from the file manager onto The Unscrambler® application window.

The window title bar is a good drop target.

It is possible to move focus between entries using the arrow keys. Hold shift to select a

range of entries.

Press Del to delete the contents of an entry.

Use Ctrl or Shift when clicking on row or column index numbers to select more than one row

or column: Ctrl+click will add the clicked index to the selection, while Shift+click will add all

rows and columns up to the clicked index.

Columns and rows can be moved by selecting them and grabbing the selection border. Drag

and release the mouse button on the target column or row where it will be moved.

Hold the Ctrl key while doing this to make a copy of the selected column or row.

36

Application Framework

When collecting data, one may gather information on a sample from different sources, for

example a spectrum and some chemical measurements, or some process data and some

quality measurements.

In the same way one may have several types of samples: the ones that will be used for

model calibration and the ones to be used to validate the model.

There are different options to store the data in The Unscrambler®: either collect the

information in the same data table or use different matrices within the same project.

Create ranges to organize subsets

It is often useful to create subset of either samples or variables to make them easily

accessible from the different plotting and analysis dialogs. This is done by defining ranges. A

quick way to start is to select a part of a data table and right click to select the option Create

Range.

The created range will be displayed in the project navigator and can be renamed to allow for

easier identification later. The color box next to the range node connects the range visually

to the corresponding entries the matrix editor.

Each subset of the matrix will be displayed separately in the matrix editor by selecting a

range in the project navigator.

More sophisticated options for working with ranges are available in the Define Range or

Scalar and Vector dialogs.

When ranges have been created in a matrix, they can be copied to another matrix of the

same dimensions. Right click on the matrix node in the project navigator and select Range -

Copy Range. The right-click option Range - Paste Range can be used to apply the same

ranges to a new matrix of the same dimensions (rows or columns).

Superimposed ranges

A region comprises a row range and a column range, thus selecting entries spanning multiple

rows and columns will result in two ranges, one for each axis.

These ranges are independent of each other and can be used in conjunction with any other

range.

This above case is typical of creating two set of variables: X (predictors) and Y (responses),

and two sets of samples for calibration and validation.

Storing data as separate matrices

In The Unscrambler® one can use different matrices in the analysis as long as they are

compatible in size and stored in the same project.

Hence one can store data in several matrices that will appear in the project navigator as

illustrated below:

37

The Unscrambler X Main

Possible data types

Variables (columns) can have one of four available data types:

Numerical

A numerical variable is one that has numbers as values.

Category

A category variable is one that has two or more category levels. There is no intrinsic

ordering required and no distinction between nominal (e.g. male or female) and

ordinal (e.g. high or low) categories.

It is recommended to use words to label category levels to give each level meaning,

such as “High” or “Low”.

Categories are stored as text, each level is assigned a index. Use View – Level Indices

to display the integer value assigned to each level.

Category variables are kept out of calculations.

Text

Each value is a text string.

International characters are supported. The encoding used internally is UTF-8.

Maximum text length is 256 characters.

Text columns are kept out of calculations.

Date-time

Each entry is a date-and/or-time.

The displayed date format can be customized, see Tools – Options… menu.

Date-time variables are kept out of calculations.

In the matrix editor these are given colors to make it easy to identify different types of

variables.

Visualization of data types in the matrix editor

Data type Background Color Alignment

Numerical Right

Date-time Left

38

Application Framework

The data type of one or several variables can be changed by selecting them and using the

option Change data type in the Edit menu. Select one of the available data types from the

menu.

When working with data, it is advisable to always keep the raw data unaltered. For

traceability and verification it is required. Keep in mind that when a transform is applied to

data matrix, a new matrix is created in the project, maintaining the original data matrix. At

appropriate steps in a workflow, use the option Insert – Duplicate Matrix… to take a

snapshot by replicating the matrix.

For more information see the duplicate matrix documentation.

By default, all the project data, results, models and plots will be saved as a proprietary

binary format with the .unsb file name extension.

It is also possible to save just a matrix from a project, by selecting the matrix, right clicking,

and choosing Save Matrix. The given matrix is then saved as a file with extension .unsb and

can be opened as a separate project.

Other options are to use File – Export to export a selected data set in file formats that can

be opened with for instance Matlab or Microsoft Excel.

The default binary format will load and save faster, whereas the XML based format makes it

easy to create software for reading data saved by The Unscrambler®.

The Unscrambler® file formats supported:

Version File name extensions1 Compatibility

The file names are given in glob notation: ”*” mean any number of characters, ”?”

any character, “[ABC]” any of A,B or C.

39

The Unscrambler X Main

This is a guide to the project navigator.

Create a project

Items in a project

Browse a project

Managing items in a project

Actions common to all item types

Actions for data table nodes

Actions for results nodes

The top node in the project navigator represents the project node. Only a single project can

be opened at one time. The project contains all of the data for a particular analysis, any

transformed (preprocessed) data, any models developed, and predictions or classifications

performed.

Models such as PCA or PLSR, or predictions using these, have their own special node icons

for better recognition of the types of analysis that have been performed.

When a user adds column or row sets to an imported data matrix, a new subnode is

displayed. This provides the user greater visualization of the structure present in a data

matrix and allows better tracking of modifications. This data organization also creates

subsets of the data that can be chosen for analysis and/or plotting.

When a user transforms the data in an imported, or generated matrix. The Unscrambler®

keeps the original data intact during transformation, and provides a new data matrix node in

the project navigator containing the transformed data.

When The Unscrambler® is launched, it will display an empty project, ready to add data.

The Unscrambler® can not have more than one project open at a time, but each project can

contain many data sets and results.

To start a new project with another project opened, use the File – New menu. A prompt will

ask if the user would like to save the current project.

The first thing to do is to get data or a model into the project. Do that by:

Creating a design matrix.

Importing data.

Importing models.

40

Application Framework

Matrices

Plots

Results: Each analysis will create a new node containing model or prediction details

Generic icons used for the project navigator nodes

Node symbol Description

Data set

Plot

The project navigator is a useful way to navigate, browse and access data sets, result

matrices, plots and visual presentations of results.

Note: It is possible to collapse (-) and expand (+) the folders to hide or show their

content.

To select an item click on it. It will be displayed in the workspace.

There are different right-click menu options available for the different item types in the

project navigator. These are described in the following.

41

The Unscrambler X Main

Plot node menu

Rename

Rename the node

Delete

Delete the node. This operation cannot be undone, so use with caution. This action

has to be confirmed in a pop-up dialog in order for the node to be deleted.

Actions for data table nodes

Data table node menu

Transform

Shortcut all the pretreatment available in the Tasks – Transform menu.

Plot

Shortcut to all the plots available in the Plot menu.

Export

Export the data using one of the supported external data formats.

Range

The Range option allows the following actions to be performed

Define Range allows the definition or row and column ranges and special intervals in

a data set. For more information see the Define Range dialog.

Copy Range Copy the selected ranges (rows or columns) to another matrix of the

same dimensions

Paste Range Paste copied ranges into the same or another matrix of the same

dimensions

Duplicate Matrix

This will create a new copy of the data matrix in the project

navigator. It is a shortcut to the Insert - Duplicate Matrix

(Insert – Duplicate Matrix…) option.

Spectra

Define a selected columnset to hold spectral data, in order to change the default

view of certain model result plots (e.g. PLS regression coefficients plotted as line in

Regression Overview, or X-loadings plotted as line in PCA Overview).

Save Matrix

42

Application Framework

Scalar and Vector

Open the Scalar and Vector dialog in order to add scalar/vector tags to column-sets,

along with units and range information. This is useful for quality control in an online

process.

Actions for results nodes

Result node menu

Recalculate

Rebuild the model with the following changes

Without Marked… (samples or/and variables)

With Marked Downweighted… (variables only)

With UnMarked Downweighted… (variables only)

With New Data… (samples only)

Register Pretreatment

When a model has been built using transformed data, all the transformations will be

selected for automatic pretreatment in case the model will be used for prediction of

new samples. In some cases the new data may have been pre-processed manually

before prediction. Use this dialog to define which transformations to be applied on

future prediction samples.

Hide/Show plots

Hide/Show the model folder containing the predefined result plots.

Save Model

Save the selected model in a new project file, as described here.

Set Components

Change the default number of components to use for prediction, as described here.

Set Alarms

Open the Set Alarms dialog to set warning and alarm limits for input or output data

of individual models. Can be applied in CAMO’s online engines for prediction,

projection and classification.This is useful for quality control in an online process.

Set Bias and Slope

Bias and slope correction is used as a post-processing step to achieve an offset (bias)

of 0 and slope of 1. This option will be available only for MLR, PCR and PLS

regression models.

43

The Unscrambler X Main

Use this dialog to store a given set of transformations applied when building a model for

reuse in prediction.

prediction, projection or classification.

Normally, the preference is to keep all transformations applied to the training data set

selected, so that prediction data are given the exact same treatment. If not the model may

be invalid, as input data will not be in the shape expected by the model.

This option allows one to save the model (results) as a separate project (smaller file). There

are several options for the results file. Depending on what option is used, the file size can be

reduced so that they are best suited for usage in prediction and/or classification. These

models can be used also with the Unscrambler Prediction Engine, Classification Engine, and

Unscrambler Process Pulse. Select a model in the project navigator and right click to select

Save Result.

44

Application Framework

In the dialog, one has the option to save several different types of model files. These smaller

model files do not support the plots, and do not include the raw data and some of the

validation matrices that are present in the entire model. The prediction (or classification)

results that can be computed depends on the type of model that is saved.

Entire model

this saves all the results and supports all visualizations that are available when a

model is developed in The Unscrambler® X. This option also permits recalculation of

the model by keeping out any selected data. This option is available for MLR, PLS,

PCR and PCA models.

Prediction

The prediction result options saves the model in smaller files, as the model result file

does not include many of the results matrices including the validation results and

other matrices used in the prediction visualizations.

Full with support for inlier detection: The model result file does not include the

following matrices: Y scores, Beta coefficients (weighted), Variable leverage, X

Correlation loading, Y correlation loading, Square sums, and Rotation. Three of the

validation matrices are saved in this model format: X total residuals, X value

validation residuals, and Y value validation residuals. This model can be used for

prediction, giving all the results that The Unscrambler® computes on prediction,

including the deviation.

Full: This model results file allows one to predict new values, and get the deviation

with that value, as well as to detect outliers (based on Hotelling’s T2 and Q

residuals). With this model, inliers cannot be computed during the prediction stage.

The Hotelling’s T2 and Q residual limits and X values are computed, but not plotted

during prediction with the Full model. Compared with the entire model, this version

saves 11 of the 20 validation matrices. It does not compute the Inlier limit and the

Sample inlier distance, nor the seven matrices that are saved with the Full (with

inlier detection) prediction result.

Short: In the short model, only the raw beta coefficients are saved, at the optimal

(or user-defined) number of components. No validation matrices are saved. With a

short prediction model, one can get the predicted results for new data, but no other

45

The Unscrambler X Main

predicted values can be made when using a short prediction. A short prediction

model is not recommended if one would like to have model and/or sample

diagnostics during the prediction step.

Classification

PCA, PCR and PLS models can be saved for use only for classification. These models

cannot however then be used for regression. This result option saves the

information from the model needed to apply this model for classification. It is a

smaller file, and contains only the results and validation matrices needed to perform

classification on new samples. The saved results matrices are for a PLS classification

model are: X means, X weights, X loadings, scores, and Loading weights. The PCA

classification model does not include plots. The results matrices with the PCA

classification model are: X means, X weights, X loadings, and scores. The validation

matrices saved in this model format are: X Variable Residuals, X Variable Validation

Residuals, X Sample Residuals and X Sample Validation Residuals. A model of type

classification can be used with OLUC X.

Number of components

A model will be saved with all the components that have been computed for it,

unless specified otherwise (and for a short model, which will be saved for the

optimal number of components by default). The user can specify the number of

components to save with a given model. This can be more, or less than the optimal

number of components for a given model.

User can set alarms during model development that can be useful during prediction,

classification and projection for new samples. Two warning limits (high and low) and two

alarming limits (high and low) can be set for the available results and validated matrices

calculated from PCA, MLR, PCR and PLSR. The values entered here serve as warning and

alarm thresholds. The alarm values can be entered in standard or scientific notation.

4.7.1 Prediction:

This will be enabled only for Regression techniques (MLR, PCR and PLSR). Low and high limits

can be set for Deviation and Scores matrices; and so for each one of Y responses . Only high

limits can be set for Hotelling’s T², Sample Leverage, X Sample Q-Residuals and Validation

Residuals. For Explained X Sample Validation Variance, low limits can be set.

Set Alarm States for output matrix of Prediction

46

Application Framework

4.7.2 Classification:

Only high limits can be set for X Residuals, Si/S0 and Leverage matrices that will be used for

classifying new samples for models developed from PCA, PCR and PLSR.

Set Alarm States for output matrix of Classification

4.7.3 Projection:

Scores matrix provides the option to set low and high limits. For Hotelling’s T², Sample

Leverage and X Sample Q-Residuals matrices only high limits can be set. For Explained X

47

The Unscrambler X Main

Sample Validation Variance, low limits can be set. Projection for new samples is available

only for models developed from PCA, PCR and PLSR.

Set Alarm States for output matrix of Projection

4.7.4 Input:

This feature helps user to understand whether the inputs are from one or different sources.

If user has already defined the columnset matrices using Scalar and Vector dialog, those will

be listed for selection. Alternatively, the Define button would open the Scalar and Vector

dialog for defining limits for columnset matrices.

Set Alarms for input matrix

48

Application Framework

Use this option to set the number of components for a model to a value other than the

optimal recommended number. This number of components will then be used when the

model is used for prediction and/or classification.

Bias and slope correction is sometimes used as a post-processing step to achieve an offset

(bias) of 0 and slope of 1. This may be useful e.g. if samples measured on a different

instrument give consistently different predictions than samples measured on the same

instrument as the calibration data. If successful, this means that the same model can be

used to predict properties of samples measured on different instruments. Caution is

required however, as any bias and slope estimation will be associated with a risk of

overfitting, and there is no guarantee that the prediction error for future samples will

49

The Unscrambler X Main

improve. Despite the risks, bias and slope correction has been proven useful in some

industries such as the agricultural sector.

4.9.1 Algorithm

Bias and slope correction is performed on prediction data Yhat by subtracting the slope and

then divide by the bias: Yhat_corrected = (Yhat – bias)/slope

The bias and slope estimates in the above equation can be taken directly from a test set

validated Predicted vs. Reference plot, or they can be input manually by user. Default values

when not explicitly specified are bias=0 and slope=1.

User can set bias and slope during model development that can be useful during prediction

for new samples. Select a Regression (PCR, PLS, MLR) model in project navigator and right

click to select Set Bias and Slope

4.9.3 Usage

In the dialog, user has the option to check the Apply Bias and Slope correction. When

checked, model will perform bias and slope correction during prediction based on any of the

below selected options.

Re-calculate from Prediction data: When selected, the bias and slope correction

factors will be the offset and slope, respectively, as taken from the ‘Predicted vs.

Reference’ plots for the new prediction data. The underlying assumption is that any

differences in bias and slope between the calibration and prediction data are due to

systematic and repeatable differences between the instruments used to collect the

50

Application Framework

two data sets. If used indiscriminantly this may decrease the actual prediction

performance and the option should therefore be used with caution. When selected,

reference Y data are mandatory in prediction.

Set or apply default correction factors: With this option default correction factors

based on the calibration model are suggested. For test-set validated models these

are the validation Offset and Slope values of the ‘Predicted vs. Reference’ plot,

under the assumption that the test set data are measured on a different instrument

that is representative also for future predictions. For leverage and cross-validated

models this assumption cannot be met and the default bias and slope is therefore 0

and 1, respectively. The user is free to manually change the default values, in which

case a message will be displayed that the values have been manually edited. A Reset

button will revert the bias and slope correction factors back to the default values.

4.10. Login

Two modes of operation are available in Unscrambler

that need to comply with the regulations of 21 CFR Part 11 (electronic signatures).

Non-Compliance Mode- Recommended for users and industries that do not require

electronic signature authentication and audit trailing.

The choice of installation procedure and internal program setup determines what level of

login is required by a user. This is described further in the following sections.

When The Unscrambler® is installed in Non-Compliance mode, the first time the program is

started, the Guest login screen is displayed,

Guest Login, Non-Compliance Mode

The Guest login requires no password or definition of a user group domain, so by clicking on

Login a user is entered into the program.

In Non-Compliance mode, a user name and login password can be setup from the Help -

User Setup menu.

If a user name and password have been set up, when a user attempts to login to the

program, a dialog similar to the one shown below is provided,

Login with defined User Name and Password, Non-Compliance mode

51

The Unscrambler X Main

In this case a user called User 1 was setup. This time, a password is required to enter the

software. If a user forgets their password, the Forgot? option should be selected. This is

described further in the next section.

Password reminders

It is possible to click Forgot? next to the password entry for a password reminder question

that is configured during user setup.

Password recovery dialog

In this dialog, a user is required to enter the correct answer to the security question and are

then required to enter a new password (with confirmation).

If the wrong answer to the question is entered, the following warning will be provided,

If the new password has not been entered the same way in the confirmation box, the

following warning will be provided,

Incorrect password confirmation warning

52

Application Framework

When The Unscrambler® is installed in Compliance mode, it uses the Windows

Authentication details of the user logged into the computer that is being used for the

analysis. There are two options available during the installation and setup of the program,

Set up compliance mode with Login dialog shown each time the program is started

Set up compliance mode with a hidden Login dialog

When the installation is performed such that a user is required to login to The

Unscrambler®, a dialog similar to the one shown below is provided.

Windows Authentication login

The users windows name is shown in the login screen. To enter the program, the user must

enter their windows password.

Automatic entry

When the program is installed in compliance mode, but the Hide login screen option is

chosen, when a user starts The Unscrambler® they are automatically logged into the

program and the windows authentication details are used in the Audit Trail.

This authentication method takes advantage of centralized user management features used

in regulated network configurations, instead of redefining the user names.

53

The Unscrambler X Main

For more information on how The Unscrambler® security features help a company to comply

with the requirements of 21 CFR Part 11, please have a look at the Statement of compliance

4.11. File

4.11.1 File menu

File – New

or Ctrl+N

This option is used to create a new project.

A new, blank workspace is created with a single node entry in the project navigator named

“New Project”.

See organizing data to get started adding data to a project.

File – Open…

or Ctrl+O

This option opens an existing project, using a regular file selector dialog.

File – Close

or Ctrl+W

This option closes the current project file. If changes to the project have not been saved, The

Unscrambler® prompts the user to save the project before closing it.

This option allows the import of data from an external data file. This may be data from

another project file, an earlier version of The Unscrambler® or one with a different format,

e.g. Excel, ASCII, or data files from instrument formats.

For more information see the importing data documentation.

File – Save

or Ctrl+S

Saves the currently open project file.

Save the current project in a new location or with a different file name.

The Unscrambler® will save projects using a proprietary binary format with the .unsb file

name extension.

Depending on whether a user is in the Editor or Viewer mode, an option to save the matrix

or the model to a location separate from the project is available.

54

Application Framework

File – Export

This is a menu option which allows one to export all or selected parts of a data matrix to an

external file, in one of the available export formats.

For more information see the exporting data documentation.

File – Print…

or Ctrl+P

This will open the Print dialog, where the user selects settings to print the current document

to a printer or file.

For more information see the print dialog documentation.

File – Security

The Security function contains two options, Protect and Sign.

Protect

This command enables a user to protect a project with a password. Whenever this project is

accessed, the user will need to provide the password to open it. A project file can also be

Unprotected by using the command File-Unprotect, and entering the correct password.

Note: The password must be remembered! If it is lost, the project cannot be opened again

Sign

For a more detailed description on how The Unscrambler® implements Digital Signatures,

click here

The Security feature is part of the overall data integrity and compliance capabilities of the

software, which also includes Windows Authentication and Audit Trails.

For more details on how The Unscrambler® meets the requirements of digital and electronic

signatures, please refer to the section on Data Integrity and Compliance

File – Recent

The list of recently opened projects is displayed. One can toggle different projects upon

selection.

File – Exit

This allows one to quit The Unscrambler®. If any project files have been changed since the

project was last saved, there is a prompt asking if changes are to be saved.

This will send the currently viewed plot or data table to a printer.

55

The Unscrambler X Main

Plots are scaled to fit within the margins set for the designated paper size and will retain the

same aspect ratio as is seen on the screen.

Data tables will normally print with 50 rows and 6 columns per page, depending on the

numeric format and font settings. Row and variable names and numbers will be included on

each page.

Print options from The Unscrambler® works as in any Windows application, where the user

selects printer, paper size, orientation, margins, etc.:

One may print either the current plot, or all plots. Select Current Plot to print out only the

currently active plot on screen; select All Plots to print out all plots currently shown on

screen.

In the field Print range designate what to print by selecting the appropriate radio button.

The print range applies to the current window in the Workspace. Use Selection if a range in

the current window has been selected to print.

Note: There must be a file open (in the Editor or the Viewer) to have access to this

option.

The Print dialog for plots offers the possibility to print either the Current plot, or All Plots.

Select Current Plot to print out only the currently active plot on screen; select All Plots to

print out all plots currently shown on screen.

Select the printer to use from the Printer drop-down list.

The properties of the printer can be viewed by pressing Properties. See the operating

system documentation or printer manual for information on setting up the printer.

Information can be printed to a file by clicking on the Print to file box.

56

Application Framework

Print preview

It is a good idea to preview a document before sending it to the printer. Print preview

provides a look at how the pages will look when they have been printed. The option is only

available if a file is currently open.

4.12. Edit

4.12.1 Edit menu

The Edit menu has three different modes, and the displayed options depend on which part

of the application window is active at any given time. There are separate modes for the

workspace editor and viewer as well as for the project navigator. Some menu items are

common for two or three modes.

Common actions

Edit – Undo

Edit – Redo

Edit – Cut

Edit – Copy

Edit – Paste

Edit – Delete

Navigator mode

Edit – Rename

Edit – Spectra

Editor mode

Edit – Copy with Headers

Edit - Insert Copied Cells

Edit - Append Copied Cells

Edit - Reverse

Edit - Convert

Edit - Fill

Edit – Find and Replace

Edit – Go To…

Edit – Select

Edit – Sort

Edit – Append

Row(s)/Column(s)…

Category Variable…

Edit – Insert

Row(s)/Column(s)…

Category Variable…

Edit – Split Text/Category Variable

Edit – Change Data Type

Edit – Scalar and Vector

Edit – Define Range…

Edit – Group rows…

Edit – Make header

Edit – Add Header

Edit - Category Property

57

The Unscrambler X Main

Viewer mode

Edit - Add Data

Edit - Create Range

Edit - Sample Grouping

Edit - Copy all

Edit – Draw

Edit – Mark

The workspace editor Edit menu mode is activated by clicking anywhere in a data table.

The workspace editor Edit menu

The workspace viewer Edit menu mode is activated by clicking in a plot. The same menu will

be shown irrespective of whether it is a raw data plot or a model results plot, however some

menu items will be grayed out when not applicable to specific plots.

The workspace viewer Edit menu

58

Application Framework

The project navigator Edit menu

Common actions

Edit – Undo

or Ctrl+Z

This option reverses the last operation(s) performed on the data in the editor. This can be

used to Undo up to the last 10 operations. The size of the undo stack can be increased, see

Tools – Options… menu.

The following operations can be reversed with the undo operation:

Cut, paste action with column, row, headers

Change data type for column and headers

Delete data action for entry (including headers)

Delete row/column/headers action

Drag and drop of entry/column/row/headers

Move row, or column

59

The Unscrambler X Main

Move column to row headers

Edit – Redo

or Ctrl+Y

It is possible to recover the results of an editing operation(s) that has just been undone with

the help of the Redo command.

A selection can be recovered from the clipboard using the Paste command or Ctrl+V.

Edit – Cut

or Ctrl+X

This option removes the selected range, either data in the Editor or a plot in the Viewer, and

places it on the clipboard. Anything placed on the clipboard remains there until it is replaced

with a new item. Use the Paste command to copy the selection to a new location.

Edit – Copy

or Ctrl+C

With this option one can copy the selected range to the clipboard, overwriting its previous

contents. The selected range is not removed from its original place. Use the Paste command

to copy the selection to a new location.

Edit – Paste

or Ctrl+V

This command one to insert a copy of the clipboard contents at the insertion point. The

command is not available if the clipboard is empty or the selected range cannot be replaced.

Edit – Delete

, Ctrl+D or Del

This option enables one to delete columns or rows. One can select one or more

columns/variables or rows/samples, and deletes the selected section(s).

Any previously-defined sets are adjusted for the deleted range.

Navigator mode

Edit – Rename

Rename the currently selected matrix.

Edit – Spectra

Ranges can be defined as being spectra, and once this setting is ticked off for a given range,

loadings plots for these data ranges will display as line plots rather than 2D scatter plots.

60

Application Framework

Editor mode

or Ctrl+Shift+C

With this option one can copy the selected range to the clipboard, overwriting its previous

contents. The selected range is not removed from its original place. Use the Paste command

to copy the selection to a new location.

Inserts copied rows or columns from the selected position in the matrix

Appends copied rows or columns to the end of a data matrix.

Edit - Reverse

With this option one can reverse the sample order and/or variable order in a selected

matrix. For more information see the reverse documentation.

Edit - Convert

This command allows one to convert the units of a column headers for spectral data from

wavelength in nanometers (nm) to wavenumber (cm-1) and vice versa. This function is

active when the the column header of a matrix is selected.

Edit - Fill

This command allows a user to fill a highlighted row or column range with either numeric or

categorical data.

For more details see the Fill section.

Ctrl+H

This command allows one to find entries containing a given value or sequence of characters,

and replace the selected value with a new one. The Find search mode consists can be

selected as text, number and Date Time from the drop-down list. For more information see

the find and replace dialog documentation.

Edit – Go To…

Allows user to move focus to a specific entry in the data table.

For more information see the go to dialog documentation.

Edit – Select

Edit – Select has the following options

Select Rows

To select respective sample.

61

The Unscrambler X Main

Select Columns

To select respective variable.

Select Range

To select a range of samples and variables.

Select All (Ctrl+A)

To select the entire matrix.

In the first three cases, the user is asked to enter a range to select. It uses the same syntax as

the Define range dialog, e.g. 1,3-5,8-20.

Note: The Unscrambler® always works with either rows or columns. This also

applies when the whole matrix is selected. Look at the cursor shape or the

rows/columns numbers to see whether the selection is for a row or column mode.

Sample names will also be selected when operating on rows, and column headers

when operating on columns.

Edit – Sort

Sort samples according to their numerical values for the selected variable.

Sort has two options: Ascending and Descending.

Select one or more columns to sort. Headers can also be selected and used as sort keys.

This method uses the quick sort algorithm, which performs an unstable sort; that is,

if two elements are equal, their order might not be preserved. In contrast, a stable

sort preserves the order of elements that are equal.

Edit – Append

Row(s)/Column(s)…

This option can be used to append rows or columns, depending which entries are selected in

the data table.

A dialog is displayed allowing the user to enter the number of rows(columns) that are to be

appended at the end of the existing data matrix.

See Edit – Insert – Row(s)/Column(s)… below for details.

Category Variable…

Append a new category variable (column).

Details on how to specify a category variable can be found here.

Edit – Insert

Row(s)/Column(s)…

Insert new rows or columns.

Select a row or a column to insert either one or more rows or columns, respectively.

A dialog will pop-up to ask how many rows or columns to insert:

62

Application Framework

Category Variable…

Insert a new category variable (column).

Details on how to specify a category variable can be found here.

Text: Converts text variable into multiple new text or category variables as needed.

Category: Create one new column for each level, with binary values (true/false). These will

be inserted to the left of the selected column.

One can change the data type of one or several variables by selecting them and using the

option Change Data Type in the Edit menu. The available data types are:

Text

Numeric

Date-time

Category

This item opens a dialog where units can be assigned to previously defined or new column

ranges. Each column range can also be defined as a scalar (e.g. single process variable) or

vector (e.g. spectrum).

For more information see the Scalar and Vector documentation.

or Ctrl+E

Create and edit ranges for easy access to often-used selections.

For more information see the define range dialog documentation.

Create row ranges based on a category variable or a variable split linearly into value ranges.

For more information see the add row range from column dialog documentation.

Convert the selected column or row to a header.

This action can also be invoked by right clicking on a row or column number.

The existing row or column will be removed as a result of making it a header, and a header

can not be converted to data.

63

The Unscrambler X Main

Insert an extra header.

A row or column header must be selected to add either a new row or column header,

respectively. Choose to insert the row header above or below, or the column header to the

left or right.

There can be up to five column and row headers.

This option allows one to change the properties of category variables, more details on which

can be found at Property dialog.

Viewer mode

To be able to add data to an existing plot it is necessary to select Edit- Add Data….

The following dialog box opens.

Add Data… dialog box

Matrix

Use the drop-down list if the data are in a data matrix and use the select result

matrix button if the data are in an analysis result.

Rows and Cols

Use the drop-down list if the subset is already defined and use the Define button if it

has to be defined.

Once some samples / variables are selected in a plot it is possible to create a new range

including them. This can be done using the Edit - Create Range option or by right clicking on

the plot with the selected items and selecting the option Create Range.

The new range appears under the matrix that was plotted as a new row or column set.

64

Application Framework

For more information see the Sample grouping dialog documentation.

This action will copy all plots in the current viewer to the clipboard and make it available for

pasting into documents, etc.

Edit – Draw

This option allows a user to add a drawing object to the plot. It is possible to draw with five

different types of objects: line, arrow, rectangle, ellipse or text. This option can also be

accessed by right clicking while in a plot and selecting Insert Draw Item

For more information see the plot annotation documentation.

Edit – Mark

Mark objects (samples or variables) to bring focus to them in plots and interpretation. There

are options for automatic sample or variable selection based on modeled data, or for

manual marking using the one by one, rectangle or lasso tools.

The submenu for marking objects

A typical use of this command is to mark extreme samples in a score plot in order to

investigate the behavior of those samples on other plots. Another is to mark ranges of the

spectra in the Important variables plot, to make a new model based on only important

wavelengths.

Note: If the Viewer contains more than one plot, marking is only possible from the

currently active subframe. For instance, if the currently active subframe contains a

scores plot, only samples can be selected. In order to mark variables, one must click

on the subframe containing a variable plot in order to mark any variables.

Once objects have been marked, they appear marked in all current and future plots, until

they are unmarked or when the Viewer is closed.

Access the category converter

The Category converter is accessible from two menus:

65

The Unscrambler X Main

Select a variable. Go to the menu Edit and select the option Change Data Type and

from the four choices select Category….

Menu Edit – Change Data Type – Category…

Right click

Select a variable. Right click. Select the menu Change Data Type – Category….

Right click access to the Category Converter

66

Application Framework

There are two way of creating levels for category variables:

Use ranges of values

67

The Unscrambler X Main

If there were already some values in the selected variable each of them will be defined as a

level. Click on OK if this corresponds to what is needed.

The variable background changes color to differentiate it from the numerical variables.

It is possible to add new values for new samples or to select one of the available ones by

using the drop-down list.

Choices of levels in the drop-down list

If the variable to be converted into a category variable is a continuous variable, it is

recommended to use ranges of values.

To do so select the second option available in the Category Converter: New levels based

upon ranges of values.

68

Application Framework

The preselected variable is in the field Select Variable. If the variable to be used in a

different one select it using the drop-down list.

The field Value based on selected Variable gives information on the selected variables such

as:

The minimal and maximal values.

This information is displayed to guide one to select the number of levels to choose and to

define the intermediate ranges.

Select the number of levels using the associated box.

Decide the method to be used to define the range among the two following options:

Divide total range of variation into interval of equal width

If this is the selected option the ranges will be automatically defined when changing

the number of levels.

Specify each range manually

Double-click on the entry to define the ranges.

69

The Unscrambler X Main

appear if the entered value is not correct

When done, click on OK.

This option allows one to change the properties of category variables that have already been

defined. The name of the category column, as well as the name for any given category can

be changed. The order of categories can be changed, categories can be added, and already

defined categories can be deleted.

This is also available as a right click option. Highlight a column and right click, the following

options will be displayed

70

Application Framework

This option allows a user to select specified row or column ranges and fill them with either a

constant number for numerical columns, or text if the row or column is defined as text.

This option also allows selected rows to be filled with pre-defined categorical variables.

The dialog box for the Fill option is provided below.

To fill a column/row with a specified value, either highlight the entire row/column or select a

sub-section using the mouse and select Edit - Fill. Enter the specified value (or text) in the

Value box and click on OK. The selected region will be filled with this value.

Note: A block of rows and columns can also be selected using this option.

To fill rows/columns with a category variable, first define the categories using Edit - Change

Data Type - Category. Then select specified cells and use the Edit - Fill option, this time

selecting the desired category from the Level drop-down list. Click on OK and the cells will be

filled with this new category.

71

The Unscrambler X Main

The Fill option is also available as a right click option from the Editor.

This command allows a user to find entries containing a given numerical value or word, and

replace the selected value with a new one.

There are three search modes: text, number and date-time.

Edit – Find and Replace (Ctrl+F, or Ctrl+H) launches the Replace pane, where one can specify

a value to search for, launch the search, and optionally define a replacement value and

perform the replacement. For replacing category variable with a new value not defined, a

warning will be displayed for creating a new category level.

Find and Replace:

72

Application Framework

Find option

By selecting the Options button, one is then presented with Find Option choices which

enables one to match case, replace entire entry contents with specified search criteria and

search in indicated directions in the data matrix.

Select search type Numeric, Text or Date time from the Search mode drop-down

list.

Type a word, a number, or a date to search for in the Find what field.

Or tick Range to search within numeric or date limits. This option works only for

Numeric and Date time variables

For replacing category values, select the varaible and use the Find and Replace

option.

**Text** mode will match category variables. A category level labeled "200" is still

a text string. It is recommended to use words to label category levels both to avoid

confusion and to give each level meaning, such as "High" or "Low".

Click the Find Next button to locate a cell with the chosen value or sequence of characters.

If the search is successful, the entry is marked in the editor with a black frame (or a white

frame if the search is occurring in a selected area). If no match is found, the cursor does not

move from its original place.

In addition, one can make a more specific search by clicking Options which will expand the

dialog with additional search parameters:

Match case

Make search case sensitive.

73

The Unscrambler X Main

Find only entries which have the requested sequence of digits or characters as exact

contents.

Search criteria

Specify how text is matched.

Choose Contains, Equal, Starts with, or Ends with from the drop-down list.

Search direction

Set search order to traverse horizontally first (by row), or vertically first (by column).

Restricted to selection

Base search on preselected data only.

Once a value has been specified for the Find what value, proceed with a replacement.

In the Replace with field, type in the new value or sequence of characters. Any combination

of digits and characters is allowed, e.g. A51-02.b.DSF24%. However, if the requested value

is not compatible with the current type of entry (e.g. “A51” in a numeric entry), an error

message will be displayed and no replacement will be made.

If the Find what value has already been located with the Find Next button, hit the Replace

button to replace the value in the current entry. In order to make the replacement in all

entries containing the Find what value, hit the Replace All button.

The Undo button is available once a replacement has been performed. Clicking it reverses

the last replacement made.

If the Find and Replace dialog has already closed, use the Edit – Undo command (Ctrl+Z) to

revert the change.

Use Edit – Go To… to move focus to a given data matrix location. This function is active

when the cursor is in an active matrix window.

74

Application Framework

Result after:

This function allows to quickly move around to specific entries in a data matrix.

This tool will insert a new column with a category variable, either by manually entering

levels, or deducing true/false levels based on one or more non-overlapping row sets.

Create category variable: Specify levels manually

75

The Unscrambler X Main

76

Application Framework

or Ctrl+E

Ranges define specific parts of the data table in order to perform analyses on. When a set of

columns is defined, this is called a Column range and usually defines a specific set of

variables. These variable sets may define a single independent (X-data) range for methods

like PCA or two sets such as the X-data and the dependent Y-data for methods such as PLSR.

When a set of rows (or samples) is defined, this is known as a Row range and these are

useful when defining training and validation sets for any analysis method in The

Unscrambler®.

Combinations of row and column sets together define specific data regions to be used for

analysis purposes and the preparation of data can be performed using the Define Range

option.

Get information on:

Define range dialog

Create range from data editor

Create range from scores plots

Automatic keep outs

The Define Range dialog can be accessed from:

Menu Edit – Define Range…

77

The Unscrambler X Main

If the case arises that a new range has to be defined during an analysis setup, most of the

plotting and analysis dialogs in The Unscrambler® have the Define button available. An

example from the PCR dialog is shown below

Define buttons in the PCR dialog

78

Application Framework

By selecting this option from either the Edit menu or from an analysis dialog, the Define

range dialog box described in the next section will appear.

Define range dialog

Dialog

The Define Range dialog is a multi-task, interactive window for easily defining specific row

and column sets prior to analysis.

Define range dialog

79

The Unscrambler X Main

Dialog Usage

Functions

The dialog box contains the following functions for easily defining sets within a selected data

table.

Row and Column Ranges

This section provides two lists of the available row and column sets available in a

table. To add a new row/column set, either interactively select the sets using the

data viewer with a mouse, or manually enter specific ranges into the text dialog

boxes. For example, if a new row set is to be defined called training, and it is to

cover rows 1-10 of the current table, the dialog for Row ranges should be set up as

follows,

To add the new row set to the list, click on the Create button. Use a similar procedure for

defining new column sets.

80

Application Framework

If modifications have to be performed to an existing row or column set, simply

highlight the set from those available in the list, make the modifications using either

an interactive or manual change and click on the Update button. The set definition

will be updated accordingly in the list.

Inverting a selection

In some applications, the definition of training and test sets is an important step in

multivariate analysis. If a training set has been defined and the test set is to be

defined as the rest of the samples not defined by the training set, click on the Invert

Selection button , and the reverse of the current selection will be selected. To

add the inverted selection to the list, provide the row or column set with a unique

name and click on Create. This will define a training and test set which is particularly

useful when using Test Matrix Validation.

Range deletion

To remove existing rows or columns sets from a list, simply highlight the sets and

click on the Delete Range button

Using all of the actions described above, when the OK button is selected to apply the

changes, all of the defined ranges (or deletions) will be shown in the data matrix node in the

project navigator.

Keep out

Use this option to define samples or variables to be kept out in the analysis from the

defined range(s).

Variables and samples satisfying given conditions are automatically added to these

lists. For more information on how this works see below.

Special intervals The special intervals option can be selected for performing predefined

actions to a data table when defining row or column sets. To access this functionality, click

on the Special Intervals button

81

The Unscrambler X Main

Interval

Insert regularly spaced row or column indices using the drop-down list “Samples”

and “Variables” values. There are two parameters to enter:

The starting sample in the field Starting from spin box.

Use this option to define evenly spaced calibration (or validation) samples and use the Invert

function described above to easily define such sets.

Random

Insert random row or column indices using the drop-down list “Samples” and

“Variables” values and indicating a number to define in the manual entry box.

Category

Insert row indices based on a category variable. Select the category variable in the

drop-down list.

When the appropriate ranges have been selected click OK to apply the changes.

Create range from data editor

Ranges can be created directly within the data set editor: Begin by selecting the part of the

table that will be included in the range and right click to select the option Create Range,

Create Row Range or Create Column Range as appropriate.

Create Row Range

82

Application Framework

Sample sets can be created directly from the PCA/PCR/PLSR scores plots as well. Select some

samples using any of the Edit - Mark options and then right-click Create Range. In the dialog

that opens there is an option to use either the marked or unmarked samples (or both). The

selected samples will be added to a new or existing matrix in the project navigator.

See extract samples documentation for details.

Automatic keep outs

Variables and samples not applicable in calculations are automatically added to the lists of

Keep outs. Entries are excluded based on the following (method dependent) criteria:

Columns with category, text or date-time variables.

Entire columns or rows with constant values.

Columns where all values are missing.

83

The Unscrambler X Main

When working with data selector that have keep out samples/variables, an warning will be

displayed allowing the user to either accept and proceed with keep outs or to cancel the

action. The Details option will display the list of keep outs.

To keep track of row and column exclusions, the data selectors provides a warning to users

that exclusions have been defined. Click on the More details link to see what has been

excluded.

More details

Automatic keep outs can only be removed manually. This means that in cases where a

category variable has been converted to a numeric column, or missing entries have been

filled in, the keep out lists must be edited to include given entries in further analyses.

84

Application Framework

The order of samples and variables in the data matrix can be reversed by choosing the Edit -

Reverse option from the menu when the cursor is in a data matrix.

The Reverse option menu is shown below

Select a variable to be used for the definition of row ranges. This variable can be:

Or a numeric variable.

Then access the option Group Rows from the menu Edit. A dialog box will open.

Add row ranges on a category variable

When the variable selected is a category variable, all levels will be used to define new

ranges. Therefore the Number of group is disabled.

Add row ranges dialog from category variable

When clicking OK, new row ranges are defined being named in the same way as the levels.

85

The Unscrambler X Main

When the variable selected is a numeric variable, the Number of group has to be specified.

The ranges are divided linearly in equal ranges of values.

Add row ranges dialog from numeric variable

When clicking OK, new row ranges are defined being named range1, range2, etc.

The menu option Edit – Sample grouping… can be used to group samples in a plot. This can

also be accessed in any plot by a right mouse click.

This feature is available in the general following plots:

Line plots

Bar plots

When clicking on the menu Edit – Sample grouping…, the dialog box Sample grouping &

marking opens.

Select the matrix to use for sample grouping in the Data frame. All available row sets will

appear in the dialog. They can be selected and moved to Marker settings by using the

arrows. The sample grouping will be based on the groups added to this box. Clear the

available row sets using the Clear button.

Alternatively the user can select a single column from the matrix to use for sample grouping.

If the selected column is a category variable, click Create Row Sets in order to make each

category level available for grouping. If the selected column is of numeric data type, Create

Row Sets will split the samples into a number of equally spaced ranges defined by the

Number of groups box. When created in this dialog, the ranges are created temporarily for

marking the samples. These ranges are not added to the data table in the project navigator.

To delete a selected group from Marker settings, mark the group and use the Remove

button. Alternatively use the Clear All button to remove all defined groups.

The user has the option to separated samples based on colors, symbols or both, and the

group name can optionally be used as point labels. Use the Apply button to preview the plot

settings, or click OK to apply the settings and close the dialog.

The user also has the option to label the samples by pre-defined values that may be

available in a particular column of a data sheet. The appropritate matrix and the

corresponding column need to be selected using the Data for labeling matrix. This will be

enabled only when value is selected from the Label option.

86

Application Framework

The Scalar and Vector dialog box allows user to define additional properties of data. Data

may be acquired from different sources and these properties help identifying the data

during online processing.

Scalar and Vector Dialog

87

The Unscrambler X Main

A single variable column range is defined as a Scalar and the Units, Min and Max

values can be specified. For example a scalar Temperature can be specified within an

allowed range of 25 to 35 degrees Celsius by setting Units=C, Min=25 and Max=35

A multi-variable column range is referred to as a Vector. This is usually a spectrum

where the Start and End wavelength can be defined. For instance an NIR absorbance

spectrum can have Units= and Start and End wavelengths of 1100 and 2500,

respectively.

The Min/Max values are disabled for Vectors and Start/End values are disabled for

Scalars

It is a text parser function that takes any text variable or row header and splits it into

multiple text or category variables as desired. This function can be accessed from Edit-Split

Text Variable or right-click menu option after selecting a row header or variable of type

‘text’.

The split text function works with two options separator and character position.

Separator:

This feature is similar to ASCII import accommodating commonly used separator

types comma, space, semicolon and custom values. Double quotes and consecutive

separators can be handled efficiently.

Split by separator dialog

88

Application Framework

Character position:

This feature splits text variables into new variables based on the position of the

characters only. The start split value indicates the number of characters to split and

so the second split. The default value for first split is 0 and second split is 6.

Split by character position

89

The Unscrambler X Main

Output options:

The following output options are available.

In case the user is interested to retain one or few of the new variables after split, the

range of columns in numeric can be defined in ‘Insert Columns’ using commas and

dashes. The selection can also be set using the mouse in the preview window.

The output variables can either be converted to category type using the option

‘Convert to category’ or append all the output variables as text to existing row

headers using the option ‘Add headers’.

4.13. View

4.13.1 View menu

The View menu has two different modes, and the displayed options depend on which part

of the application window is active at any given time. There are separate modes for the

workspace editor and viewer.

Editor mode

90

Application Framework

View – Navigator

View – Info

View – Level Indices

Viewer mode

View – Graphical

View – Numerical

View – Auto Scale

View – Frame Scale

View – Zoom In

View – Zoom Out

View – Legend

View – Properties

View – Full Screen

Context dependent plot indicator lines

View – Trend Lines – Target Line

View – Trend Lines – Regression Line

View – Uncertainty Limit

The workspace editor View menu mode is activated by clicking anywhere in a data table.

The workspace editor View menu

The workspace viewer View menu mode is activated by clicking in a plot. The same menu

will be shown irrespective of whether it is a raw data plot or a model results plot, however

some menu items will be grayed out when not applicable to specific plots.

The workspace viewer View menu

Editor mode

View – Navigator

Toggle project navigator pane on/off.

View – Info

Toggle information pane on/off.

Available when a data set has category variables. Toggle category variable view as level

integers on/off.

Viewer mode

View – Graphical

This lets the user view the selected data of a Viewer in a graphical mode. This is the default

view for The Unscrambler®.

91

The Unscrambler X Main

View – Numerical

Through this option a user may display results plotted in a Viewer as a numerical table. One

can copy that data table to the Clipboard and paste it into an Editor.

Restore the plot using View – Graphical

This option scales the plot so that all data points are shown within the Viewer window. This

command is useful after using Add Plot and Scaling.

This option scales the plot in a selected frame. One can change the plot by scaling its axes to

fit the desired range. Select the desired area to zoom in a frame.

Use Autoscale to display the plot as it was originally.

View – Zoom In

This option changes the plot scaling upwards in discrete steps, allowing one to view a

smaller part of the original plot at a larger scale. This can also be done by using the + key on

the graph.

This option scales the plot down by zooming out on the middle of the plot, so that more of

the plot becomes evident, but at a smaller scale. This can also be done by using the - key on

the graph.

View – Legend

View – Properties

This opens a dialog where a user can customize a plot. Here one can change plot

appearance, such as grid, axes, titles, fonts and colors.

See the formatting of plots documentation.

Make the plot fill the whole screen. Press Esc on the keyboard or right click to leave the full

screen mode.

92

Application Framework

Trend lines are available to help interpreting Predicted vs. reference plots.

The target line is the line with slope = 1.0 and offset = 0.0 (or equation Y=X). In many cases

this line will be the optimal solution, e.g. in predicted vs. reference plots.

A regression line is drawn between the data points of a 2-D scatter plot, using the least

squares algorithm.

Available for Predicted vs. reference plots.

Uncertainty limits can be indicated using this option for regression coefficients line plots.

For more information, see Martens’ Uncertainty Test and how to plot regression

coefficients.

4.14. Insert

4.14.1 Insert menu

Use the Insert menu to add items to the project navigator.

Add a new data table, which may be empty, or filled with predefined values.

See the insert data matrix dialog documentation.

Create a designed experiment table to perform a DOE.

See the design experiment wizard documentation.

Create a replicate of an existing data table.

See the duplicate matrix dialog documentation.

Create custom layouts for plotting any data matrix or results in a two-plot or four-plot

viewer.

See the custom layout dialog documentation.

93

The Unscrambler X Main

When working with data, it is advisable to always maintain a copy of the raw data.

In addition, to use matrices generated while running an analysis for other purposes, it is

necessary to duplicate them. Select the matrix to be duplicated and use the menu option

Insert – Duplicate Matrix… to obtain a replicate of the data table.

This will create a second data matrix, bearing the same name with a replication number in

parentheses, for example “(1)” for the first replication. It is now possible to work on this

replicated matrix.

Duplicate matrix dialog

A window will open, so as to enable a specific selection of the matrix and ranges to

duplicate.

Duplicate matrix dialog

When hitting the OK button, a second data set will be created, bearing the same name with

a replication number in parentheses, for example “(1)” for the first replication.

The structure of the table (row and column ranges) will be maintained.

Duplicated matrix

94

Application Framework

In this section, information is given on how to create a new data table. This can be done

from the Insert menu, selecting Data Matrix….

When clicking on this option the Add Data Matrix dialogue appears where one can define

the size of the data matrix in terms of rows for the samples, and columns for the variables.

By default, the values are 10 both for the number of rows and columns. This can be edited

by using the arrows or by directly typing in the desired number.

The initial values for the matrix can be chosen from the following options in the drop-down

list in the Add Data Matrix Dialog:

Blank

Unit matrix (diagonal 1 rest 0)

Random values (0-1)

Random values (Gaussian)

Constant

Serial numbered rows

Serial numbered columns

Serial rows with shift

If Constant is chosen, this value should then be entered in the Constant value field.

The Include Headers option will automatically display the default header names for Rows

and Columns in the data matrix.

95

The Unscrambler X Main

After clicking on OK, a matrix will be created with the default name “Data Matrix”. It

contains no values if Initial values were set to Blank, otherwise the designated values are in

the entries. Data can be entered into the empty cells.

Fill a data table

Data may be entered into a blank data table in several ways.

Manually

Data can be entered manually by double clicking on the specific cell and entering the

value. This operation can be done for the data table as well as the sample and

variable name.

Copying data from a spreadsheet (Excel)

Data can be copied from Excel to The Unscrambler® by either drag and drop, or by

copying and pasting it. To drag and drop the data from Excel, it must be selected in

Excel and then dragged into the specific entry or to the beginning (top left corner) of

the area where the data are to be added. The same can be done for the sample and

variable names. Data can also be entered from Excel by using the copy and paste

functions.

Rename

The default name of the data table is “Data Matrix”, but this can be renamed with a more

descriptive name. Rename the data matrix by right clicking on the data matrix icon in the

project navigator and selecting the option Rename.

When this is done, the name will be updated in the project navigator as well as in the

visualization window and navigation bar.

Other functions are also available from this right click menu.

Other approaches to adding data matrices

There are two other options to generate a data table in The Unscrambler®:

Importing data

Create a design table

The Custom Layout tool is a way to display any two or four selected plots.

It can be very useful for example to display the results of two PCA analyses with two

different pretreatments as shown in the plot below for easier comparison.

Custom Layout of two PCA score and loadings plot with or without pretreatment

96

Application Framework

To access this option select the menu Insert – Custom Layout… and select the desired

layout:

Four viewers,

Two Horizontal…,

Two Vertical….

This menu give access to a dialogue box divided in four parts corresponding to the four

frames of the visualization window, all containing the same options:

Custom Layout Dialog

97

The Unscrambler X Main

Choose Matrix

This button is used to select the data set and variables to be plotted. By clicking on

Matrix it is possible to select a data matrix from the navigator. Adjust the Rows and

Cols to display only what is appropriate.

Choose Matrix dialogue box

To select a matrix that was generated during an analysis, hit the select result matrix

button . The following dialogue box will appear. From here it is possible to

select any matrix.

Choose Matrix - Analysis dialogue box

98

Application Framework

Type

This drop-down list presents the plot options:

Type drop-down list

Bar: Click to see information about Bar plots.

3D Scatter: Click to see information about 3-D Scatter plots .

Line: Click to see information about what a Line plot .

Matrix: Click to see information about Matrix plots.

Histogram: Click to see information about Histogram plots .

Normal Probability: Click to see information about Normal Probability plots .

Multiple Scatter: Click to see information about Multiple Scatter plots .

Title

Type in the title to be displayed on the specific plot.

Once all the necessary plots have been defined hit the OK button, this action will display the

selected plots.

It is always possible to abort this action by clicking the Cancel button.

Once the plots are displayed they are editable using the Properties menu accessible from a

right click on the plot or from the menu shortcut .

Further information is available for the following options:

Format a plot,

Annotate a plot,

Zoom and re-scale a plot,

Save and copy a plot.

99

The Unscrambler X Main

Data Compiler:

This section helps the user to process and filter bad and suspect spectra out of large

dataset based on combination of unique sample identifier and sample replicate

index. Sample identifiers or replicate scans will be identified using a categorical/text

variable and to split it, ‘Split Text/Category Variable’ feature in Edit menu is used.

When clicking on this option the Data Compiler dialog appears where one can

define the Input data, Filter settings and Output options.

Input data:

This tab provides the option to input numeric data (usually spectra) from any data

matrix in project navigator by defining the rows and columns. The sample index

allows the user to select a categorical variable; the number of samples should match

with the data selected. Non-category variable and multiple selection options will not

be allowed and all observations within one category level will be treated as

replicates of a single sample. The minimum number of replicates is used to specify

the minimum number of samples to include in average. The default value is 10 and

minimum value is 1.

Data Compiler - Input data

Filter settings:

The Filter settings tab provides option for primary and secondary filter settings.

Filtering can be done based on the models available in the project navigator and the

100

Application Framework

compatible models are PCA, PCR, PLSR and SCA. Models with auto-pretreatments

can also be defined by clicking the pretreatment button. Only full models are

acceptable.

Data Compiler - Filter Setting

Upon selection of the model, the available filter type can be selected. For PLS, PCR and PCA

the available filter matrices are

Influence (T2 vs. Q)

Leverage

Hotelling’s T2

Q-residuals

F-test residuals SCA may have some or all of the above in addition to some or all of:

Conformity limit

Spectral match value The component provides the option to select the number of

components from the selected model. The default number of components is user

defined ‘set components’. User will also have the option to select the six levels of

significance, active for filter types Influence, Hotelling’s T2, Q-residuals and F-

residuals.

The Limit settings are active for the following filter types:

101

The Unscrambler X Main

Conformity limit: Positive floating point value. Default value 3

Spectral match: Floating point value in range 0-1. Default value 0.99

For additional filtering, ‘Include Secondary Filter’ has to be selected and this follows the

same feature as primary filter.

Output options:

The following output options are available.

Data Compiler - Output Options

Add Statistics: To store the output data after filter based on primary and secondary filters,

the tested model statistics from the filtered model will be added as new column(s) to the

original data table.

Add status: The test results from the filter model for status, when selected will be added as

new category column(s) to the original data. Influence filter type will have four status levels

as Good, Extreme, Suspect and Outlier. For all other filter types, the status levels are Good

and Outlier. Additionally users have the option to add the Good and Rejected row ranges to

the existing matrix.

Add ranges for Good and Rejected: When checked (default), two row ranges ‘Good’ and

‘Rejected’ are added to original (exisitng) data table. ‘Good’ and ‘Rejected’ status is defined

by the output from both filters as well as the minimum number of replicates. Any sample

that has status Good in either primary or secondary filter, and that exceeds the minimum

number of replicates, will be interpreted as Good. All other will be tagged as Rejected.

102

Application Framework

Add mean matrix: When checked, the average of all non-rejected observations are

calculated and returned for each sample. Users also have the additional option to add the

standard deviation for each sample. Average and standard deviation are calculated only if

the number of non-rejected replicates exceeds the minimum number entered in Input data

tab.

Add median matrix: When checked, the median of all non-rejected observations are

calculated and returned for each sample. Users also have the additional option to add the

range for each sample. Median and Range are calculated only if the number of non-rejected

replicates exceeds the minimum number entered in Input data tab.

Include column with number of replicates: When checked, the first column in output

matrices will be the number of replicates used for calculating the summary statistics.

4.15. Plot

4.15.1 Plot menu

The Plot menu has different modes: One comes with the matrix editor, and for each analysis

it gives a list of plots related to that analysis.

The plot interpretations chapter provides more detailed information for generic plots.

Editor mode

Plot – Line

The Line plot displays one or more data vectors. When plotting from the Editor, mark the

row(s) or variable(s) (Columns) to be plotted; one sample/variable gives a one-dimensional

plot; specifying a range adds several line plots.

One can define ranges or create ranges for samples as well as variables from the edit menu

Edit - Define Range, see using define range.

For more information see the line plot documentation.

Plot – Bar

The Bar plot displays data vectors as bars.

For more information see the bar plot documentation.

Plot – Scatter

The Scatter plot shows two data vectors plotted against each other.

When plotting from the Editor, select the two rows or variables (columns) to be plotted

before using the Plot command.

For more information see the scatter plot documentation.

The 3-D Scatter plot shows three data vectors plotted against each other.

When plotting from the Editor, mark the three samples or variables to be plotted before

using the Plot command.

For more information see the 3-D scatter plot documentation.

103

The Unscrambler X Main

Plot – Matrix

In this plot, a two-dimensional matrix is visualized. The plot is useful to get an overview of

the data before starting any analyses, as obvious errors in the data and outliers may be seen

at once. One may also want to take a look at this plot before deciding whether to scale or

transform the data for analysis.

For more information see the matrix plot documentation.

The Normal Probability plot shows the deviation from an assumed normal distribution of

the data vector. It is not possible to plot more than one row or column at a time in this plot.

Select the sample or variable to be plotted and use Plot – Normal Probability.

For more information see the normal probability plot documentation.

Plot – Histogram

This plot displays the distribution of the data points in a data vector, as well as the normal

distribution curve. A histogram gives useful information for exploring raw data. The height of

each bar in the histogram shows the number of elements within the value limits of the bar.

For more information see the histograms documentation.

The Multiple scatter plot shows a matrix of 2-D scatter plots for comparing several variables

in a flat view.

For more information see the multiple scatter plot documentation.

Viewer mode

After running an analysis, the Plot menu for the Viewer mode will change to a list of

available plots.

See the respective analysis method chapters for how to use and interpret these plots.

4.16. Tasks

4.16.1 Tasks menu

This menu is divided into three main groups of actions: Transform, Analyze and Predict.

Tasks – Transform

The Tasks – Transform options allows one to transform samples or variables to get data

properties which are more suitable for analysis and easier to interpret. Bilinear models, e.g.

PCA and PLS, basically assume linear data. The transformations should therefore result in a

more symmetric distribution of the data and a more linear behavior, if there are

nonlinearities.

The Unscrambler® offers many spectral pretreatments like derivatives, smoothing,

normalization, and standard transformations. All these can be found under Tasks –

Transform.

104

Application Framework

There is also a Compute_General function to transform data using basic elementary and

trigonometric mathematical expressions, and the matrix calculator, which has options for

linear algebra, matrix operations and reshaping of data.

For more information and a list of available transformations, see documentation for each

transformation

Tasks – Analyze

The Tasks – Analyze option provides multivariate analysis options consisting of:

Univariate statistics:

Statistical tests

Multivariate Curve Resolution (MCR),

Cluster analysis, and

Principal Component Regression (PCR),

Partial Least Squares Regression (PLSR), and

Support Vector Machine Regression (SVR)

L-PLSR,

Linear Discriminant Analysis (LDA),

Support Vector Machine (SVM) classification, and

Analyze design matrices

Tasks – Predict

The Tasks – Predict options provides means of applying a model on new samples for

prediction, projection or classification.

Projection

Project new samples to determine similarity with samples in a PCA, PCR or PLSR

model.

Regression

Predict unknown samples from regression models.

Prediction

SVM Prediction

Classification

Classification of unknowns by applying SIMCA, LDA, or SVM models.

SIMCA classification

LDA classification

SVM classification

105

The Unscrambler X Main

4.17. Tools

4.17.1 Tools menu

or Ctrl + Shift + M

Open an existing experimental design for modifications.

See the modify design dialog documentation.

or Ctrl + M

The Matrix calculator is used to perform simple linear algebra functions like matrix

multiplication, addition, division, inverse etc. and to reshape, append or combine two

matrices.

See the matrix calculator dialog documentation.

Tools – Report…

or Ctrl + R

A tool to create reports as PDF documents with plots and data.

See the report generator dialog documentation.

This command displays the audit trail for the active project. The audit trail is a log of actions

by a user, showing a date and time stamp for the actions.

See the audit trail dialog documentation.

Please refer to plug in specific help documentation for this add on options. Contact CAMO

Software for more details.

Tools – Options…

This dialog can be used to change the appearance of the data editor or viewer, as well as

other options in The Unscrambler®. Default numeric formats and plot settings can be

defined here.

See the options dialog documentation for details.

106

Application Framework

The audit trail provides a record of the actions performed by different users. Audit trails are

required for maintaining data integrity and are a requirement of Good Manufacturing

Principles (GMP) and the US FDA’s 21 CFR part 11 requirements for electronic signatures.

Caution: Audit trails are not a substitute for well-documented work.

For each operation, The Unscrambler® keeps track of:

Date

Time Zone

Time

User name

Action.

The types of actions that are tracked in the audit trail include:

- Creation of the project - Import of data - Transformation: compute functions, smoothing,

MSC, derivative, etc. - Formatting: sorting, delete - Analysis: statistics, PCA, regression,

prediction, etc. with detailed model settings.

Audit trail dialog

In Non-Compliance mode, the audit trail can be emptied by selecting the Empty button in

the dialog.

The audit trail can be disabled from the Tools - Options under the General tab.

When in Compliance Mode, the Audit Trail cannot be emptied.It can only be saved in a non-

editable PDF document for further printing, if desired.

The Audit Trail for Compliance Mode is shown below. Also, in Tools - Options the Audit Trail

cannot be disabled in Compliance Mode.

Audit Trail in Compliance Mode

107

The Unscrambler X Main

Matrix calculator is used for simple linear algebra like matrix multiplication, addition,

division, inverse, etc. and matrix shaping. The options available are:

Binary operations: Arithmetic operations on two matrices

Reshape a single matrix

Combine two matrices

The calculator tool should be used only with matrices that are purely numeric. In case there

are missing values those columns are kept out; likewise with text and category entries. With

the remaining matrix contents the compatibility follows the feasibility of the matrix

operations.

See also the Compute_General transform that can do calculations on samples and variables

using basic mathematical expressions.

Matrix calculator dialog

108

Application Framework

Unary operations implies that the arithmetic operation is computed on a single matrix.

The Moore–Penrose inverse of an arbitrary matrix (including singular and rectangular) has

many applications in statistics, prediction theory, control system analysis, curve fitting and

numerical analysis.

In mathematics, and in particular linear algebra, the pseudoinverse A+ of an m × n matrix A is

a generalization of the inverse matrix.

A common use of the pseudoinverse is to compute a ‘best fit’ (least squares) solution to a

system of linear equations that lacks a unique solution. The pseudoinverse is defined and

109

The Unscrambler X Main

unique for all matrices whose entries are real or complex numbers and can be calculated

using the singular value decomposition.

In linear algebra, the singular value decomposition (SVD) is an important factorization of a

rectangular real or complex matrix, with many applications in signal processing and

statistics. Applications which employ the SVD include computing the pseudoinverse, least

squares fitting of data, matrix approximation, and determining the rank, range and null

space of a matrix.

QR decomposition

QR decomposition (also called a QR factorization) of a matrix that allows for the solution of

linear systems of equations.

It is a decomposition of the matrix into an orthogonal matrix (Q) and a right triangular matrix

(R). QR decomposition is the basis for a particular eigenvalue algorithm, the QR algorithm.

Element-by-element operations

Array arithmetic operations that are carried out element by element on one matrix.

X’X

Outer product of itself:

1./X

Reciprocal of individual matrix elements, or element-by-element product

X.*X

Square of the elements of X

Two matrix operations

Binary operations implies that the arithmetic operation is computed on the data and a

operand, defined by the rules of linear algebra:

Addition: X+Y

Subtraction: X-Y

Multiplication: X*Y

Matrix division: X*inv(Y)

Element by element division: X/Y

The calculations that are possible depend on dimensionality of the matrices X and Y that

have been selected in the scope.

Add, Hadamard product and subtract require X and Y to have the same number of rows and

columns or Y has to be a row or column vector with the dimension matching with X.

The X and Y matrices in the calculations should not be confused with inputs and outputs of a

model.

Reshape matrix

Change dimensions of a two-dimensional matrix.

One can rearrange the elements of a matrix to change the number of rows and columns.

This is especially useful when importing data where a matrix has been stored as a one-

dimensional list of values.

110

Application Framework

A user can combine matrices with either of the two options:

Augment X|Y: column-wise combination of matrices; i.e. 4x2 + 4x2 gives 8x2

Append Y to X: row-wise combination or matrices

Augment requires X and Y to have the same number of rows. Append requires X and Y to

have the same number of columns.

These are binary operations in the shaping tab available only when the Binary operand box is

checked. This requires that the values be numeric. If there are columns of non-numeric data,

they will be kept out of the calculation. If there are missing values in either matrix, the rows

(columns) containing them will be kept out of the calculation.

This menu option allows the user to define user preferences for the viewer, general and

editor settings, to change the appearance and performance of The Unscrambler®.

General

Select temporary folder

This is the location where The Unscrambler® stores temporary results during

calculations. These files will be removed when exiting the application.

Use audit trail

111

The Unscrambler X Main

Use this option to enable/disable the audit trail. Note: This option is not active when

the program is installed in Compliance Mode.

Prompt user to view plots

When checked, user will be prompted to view the model plots when opening a

project, after training a model and after predictions. This option will be unchecked if

the ‘Do not ask me again’ option is selected in the View Plots dialog.

Viewer

These options allow a user to set the default appearance properties of plots at the

application level. The settings can still be customized and changed at the plot level by editing

the properties for a given plot.

The following are properties that can be set from the Viewer:

Antialiasing

Use this option to set antialiasing in all analysis-generated plots.

Point label visible

Use this option to have the default view on plots have the point labels visible. Point

labels can be toggled on/off from a plot.

Line plot point visible

Use this option to have the default view on line plots have the points visible. The

point can be toggled on/off from a plot.

Point size

Use this option to set the default size of points. This can be changed for indivudual

plots under Properties.

Line size

Use this option to set the default line size. This can be changed for indivudual plots

under Properties.

112

Application Framework

Use this option to set the default size of points when applying sample grouping. This

can be changed for indivudual plots in the Sample Grouping dialog.

Crosshair axes color

Use this option to set the default color for plot axes. This can be changed for

individual plots under Properties.

Editor

These options allow a user to set the default properties of worksheet view at the global

level. This option will be available only when a data matrix is present in the project.

General

This tab provides the settings for defining the maximum number of categories

(default - 50, maximum - 100000), maximum times to undo stack (default - 10,

maximum - 5000) and file size to disable preview (default - 10 MB)

Format

This tab provides the settings for Numeric and Date time display format.

Color

This tab provides the settings for color of Row header, Column header, Category and

Matrix name.

Font

This tab provides the font settings for Row header, Column header and Matrix name

The Report Generator is a tool to generate customized reports.

113

The Unscrambler X Main

To access the Report Generator, select Tools – Report…. The Report generator dialog

appears and gives access to all matrices and plots in the current project. Add plots and

matrices in the field Included in report to create a customized report.

To add a matrix use the Data tables field and:

Either select a data matrix that is in the Navigator as a node from the drop-down list

Or select one from an analysis using the Select result matrix button

To add a plot, select one in the Available plots list and move it to Included in report with the

right arrow.

Generate Report Dialog

.

At the bottom of the dialog are three tabs where the user can choose settings for the

security, report content, and page setup.

Security

Passwords can be enabled to limit the access for editing and viewing the report. The

user can highlight password protected editing of reports.

Printing, editing, copying, or annotating can be disabled for added security.

114

Application Framework

Content

Under the content tab the user can select to append notes, and/or use the editor

format for numbers.

Report Generator Content

.

Page Setup

On the Page Setup tab, a user can define the paper size (A2, A3, A4, letter, legal),

and orientation (portrait or landscape).

Report Generator Page setup

Save the report and close the dialog using the appropriate buttons.

All reports will be saved in PDF format with a file name, and in a location given by the user.

4.18. Help

4.18.1 Help menu

The help menu provides access to help topics and licensing-related information in The

Unscrambler®.

Help – Contents

or F1

Open help viewer for browsing.

See the How to use help documentation.

Help – Search

Ctrl+F1

Open help viewer for searching.

115

The Unscrambler X Main

Change the current license of The Unscrambler® by typing in a new activation key. Use this

feature for instance to upgrade from a trial installation to a full version of The Unscrambler®.

See the modify license dialog documentation.

Manage user profiles.

See the user setup dialog documentation.

Help – About

Shows;

License holder and activation key

Addresses to CAMO Software offices

Additional information such as build number and date

A list of all upgrades and plugins installed

The System Info button will open the “Windows System Information” utility.

Use this dialog to activate or modify a license for The Unscrambler®.

Note that this requires certain privileges and may, in regulated environments, require the

intervention of a system administrator.

Press the Obtain button to request the activation key from the CAMO Software web site.

The activation key will be sent by email.

Contact a sales representative by phone or fax if the computer is not connected to the

Internet. Note that the machine ID shown in this dialog would be required.

116

Application Framework

Company name and Email address fields become active when the activation key is for a

time-limited or perpetual license.

Contact details can be found at http://www.camo.com/contact

From version 10.2 of The Unscrambler® the User Setup is only available in the Non-

Compliant mode of operation. For details of Compliant and Non-Compliant modes of

operation consult the installation guide or refer to the following sections,

Login

Compliance

Users are recommended to create a login and identification, which will not only secure their

work with The Unscrambler®, but provide valuable information to keep track of actions

taken on data, through the audit trail, where the user name is logged with any action.

Use the menu option Help - User Setup… to access the dialog.

User setup dialog

The above image shows an example of a completed setup. Enter the pertinent information

in the provided fields and then click Save.

The following is a brief explanation of the fields,

User Name

This is the name that will be shown in the login dialog each time the program is

started.

First Name

117

The Unscrambler X Main

Last Name

The surname of the user.

Initial

Usually the first letters of the first and last names entered.

Location

Here a user can enter the site/geography/company name associated with the

license.

Password Management

By checking the Password required at login option the user will be enforced to enter a valid

user name and password to use the software.

The following functions of this option are listed below,

Enter a Password

A user is required to enter a password of any size and detail into this field.

Re-enter Password

This option enforces a user to confirm that the two password entries are consistent.

If they are not, the following warning will be provided,

Password mismatch warning

Security Question

Select from a list of pre-defined questions to provide an answer to.

Answer

Enter the answer to the question here

If a password is forgotten, it can be retrieved provided the answer to the security

question is known. See the section on [Login](../signin.htm) for more details

Contact CAMO Software on information about how to register more than one user.

Contact details can be found at http://www.camo.com/contact

118

5. Import

5.1. Importing data

This section describes how to import data from supported instruments and software utilities

into The Unscrambler®.

The Unscrambler® can import the following data formats:

Symbol Vendor

119

The Unscrambler X Main

The Unscrambler® X

The Unscrambler® 9.8 and earlier versions1

120

Import

NetCDF

JCAMP-DX

Matlab data files

Instruments

Brimrose

OPUS (Bruker Optics)

CLASS-PA & SpectrOn (Guided Wave)

Indico (ASD)

NSAS (FOSS NIRSystems)

OMNIC™ (Thermo)

Varian

PerkinElmer

RapID

DeltaNu

VisioTec

Interface protocols

Databases

Other interfaces such as OPC and MyInstrument are supported. Contact CAMO Software for

details. http://www.camo.com/contact

Choose which kind of file format to import from the File – Import Data submenu, select the

files to import and click OK.

Dialogs differ according to the type of file and the amount of user input required, allowing

the user to select which matrices to import. It also provides an option to preview data

before import.

File formats are recognized based on the file name extension. If the file(s) to be

imported does not have the expected extension, it may have to be changed

manually in a file manager.

Files can also be imported by dragging them from the file manager and dropping them on

The Unscrambler® application window.

121

The Unscrambler X Main

Instead of going via the File – Import Data menu, data can be imported by using drag and

drop or copy and paste. Simply select the file/data in another Windows application like Excel

and drag it into the project navigator or the workspace of The Unscrambler®.

One can select whether to insert the data as columns or rows. The columns or rows are

appended at the end of the existing data table.

One may also overwrite the existing data in the Editor. The area that is going to be

overwritten is marked by a frame.

The file names are given in glob notation: ”*” mean any number of characters, ”?”

any character, “[ABC]” any of A,B or C.

5.2. ASCII

5.2.1 ASCII (CSV, text)

Type of data

Array

Software

ASCII (American Standard Code for Information Interchange) is a character encoding

scheme and the de-facto file standard supported by many applications.

File name extension

*.csv, *.txt, *.*

How to use it

ASCII, CSV (character separated values) and tabular text are common names for essentially

the same format: Data saved as a plain text file.

The Unscrambler® supports ASCII formats with

122

Import

Files with the comma used for decimal point

Tab delimited files

Space delimited files

Custom string used as delimiter e.g.: 1.4**4.5**6.7**8.9 ( “**” is given as custom

separator )

ASCII files with different formats can be imported into The Unscrambler® through the File –

Import Data – ASCII menu. Single file or batch import is allowed.

Batch import

When a single text-file (e.g. .txt, .csv, …) file is selected for import, the following dialog is

used.

ASCII import dialog

Data delimiters

Numbers may be delimited by different characters in different ASCII files. Specify which

delimiter is used in the file to be imported, in the field Separator. The choices are

Comma

123

The Unscrambler X Main

Semicolon

Space

Tab

Custom

Note: Carriage Return, Line Feed and Tabulation are not among the available

delimiters in the dialog. They are default item delimiters, and will automatically be

recognized as such. Do not specify them in the Custom field!

There is an additional list of check box options below:

Interpret double quotes such that separators within double quotes are not

recognized as such

Treat consecutive separators as one

Consider multiple identical separator characters as one.

Normally used for tabular text files that have been aligned into columns using

spaces.

Data Type

There are three options available for data import

Auto- The Unscrambler® will import individual columns as text or numeric data

based on the values in the first row.

Numeric - The Unscrambler® will import all columns as numeric. Cells with non-

numeric content will be lost.

Text - The Unscrambler® will import the entire table as text data type.

Individual variables can be converted to other data formats after import using Edit – Change

Data Type.

Skip Rows

This option allows a user to skip a predefined number of header rows during the

import using the number spin box

Preview

This option allows a user to turn on/off a preview of the tabular data before import.

Headers

One can add multiple rows or columns as headers.

Sample and/or variable names can be selected using the Headers options; multiple columns

and rows can be selected for variable ID and sample ID, up to a maximum of 5 headers.

The user can select rows and columns from the data preview table while importing. One can

import all of a table, or just portions of it.

Note: If names are not enclosed in quotes in the ASCII file, they should not contain

any spaces if “space” is selected as the separator. (See Separators above.)

Missing data

Any text string entries in a numeric column will be imported as empty or missing data.

124

Import

Make sure that Treat consecutive separators as one is unchecked when importing ASCII files

that have empty entries for missing data, such as:

s4,0.618,,0.6022

Batch import

Often spectrometers output spectra in individual files, such that each file contains a single

spectrum (with or without headers). A selection of such single spectrum text-files can be

imported in a single step in The Unscrambler®, simply by selecting multiple files to open. A

simplified dialog is used for batch import.

Batch import dialog

Each spectrum is imported and appended to the previous spectra row-wise. If spectra are

given as a single row in the files, this means that each spectrum will become a single row in

the imported data table. If spectra are given column-wise (i.e. separated by carriage

return/newline), they should be transposed using the Transpose the data before import

check-box.

The sample file-names are included in a row-header in the imported table.

See section on single file import above for general import options.

5.3. BRIMROSE

5.3.1 Brimrose

Type of data/instrument

NIR

Data dimensions

Multiple spectra

Instrument/hardware

Snap!32 v2.03 (BFF3)

Snap!32 v3.01 (BFF4)

Vendor

Brimrose

File name extension

*.dat

125

The Unscrambler X Main

How to use it

This option allows for the import of BFF3 and BFF4 data from Brimrose instrument files. The

BFF3 file is created from Snap!32 v2.03 while the BFF4 file is created from Snap!32 v3.01.

One or several Brimrose files (BFF3 or BFF4) can be imported into a project in The

Unscrambler®.

Select the files to import from the file list in the Brimrose Import dialog or use the Browse

button to display a list of available files. The different files must have the same number of X-

variables to allow simultaneous import.

Brimrose Import

The source files may contain one or more samples per file; multiple selections allow several

samples to be imported at the same time.

Multiple selections

Select one or more files to import by checking the check box next to each file, or by using

Auto select matching spectra.

The contents of all the selected spectra will be merged to create a one data matrix during

import.

Deselect all

Clear the current selection by unselecting all samples.

126

Import

Preview spectra

Check to review a plot of selected spectra before importing.

Sample naming…

Include sample names or sample numbers in the resulting data table.

Sample names will only be imported if they are present in the source file.

The Auto select matching spectra preview option allows the automatic selection of the all

data file(s) with the same wavelength ranges as the current selection. A screenshot of the

Brimrose Import dialog with the auto select chosen is provided below.

Once Auto select matching spectra has been checked it will select only those files that have

the same number of variables.

Sorting data

The file name, number of samples, number of X-variables, wavelengths for the first and last

X-variables, and step (increase in wavelength), are displayed for each file.

127

The Unscrambler X Main

Step is the increment in wavelength (or wave number) between two successive variables.

The following relationship should be true:

The data table resulting from the import can be sorted based on any of these columns in the

file list: Click on a column header to set sort order, and a second time to reverse the sort

order.

Preview

Preview spectra displays a line plot of selected files that have been selected for import.

5.4. Bruker

5.4.1 OPUS from Bruker

Type of data/instrument

FT-IR, FT-NIR, Raman

Data dimensions

128

Import

Single spectra

Instrument/hardware

—

Software

OPUS

Vendor

Bruker

File name extension

*.0x, *.1

How to use it

One or several spectra from OPUS data files generated by Bruker instruments using OPUS

software can be imported. The import supports 2-D spectral files. When multiple spectra are

contained in a file, the preference is to import the normalized spectrum. However if a file

contains a single spectrum (sample or reference alone), then these will be imported. Data

files containing 3-D spectra are not supported.

This option supports the import of data from OPUS files generated by Bruker instruments

using the OPUS software.

Data files containing 3-D spectra are not supported.

In the OPUS Import dialog box, one can choose a folder where OPUS files are stored. A list of

OPUS files from which data can be imported is then displayed.

Note: Multiple files that vary in their spectral range and resolution cannot be

imported together.

Select the files to import from the file list in the dialog OPUS Import or use the Browse

button to get a list of available files. The different files must have the same number of X-

variables to allow simultaneous import.

OPUS Import

129

The Unscrambler X Main

Multiple selections

Select one or more files to import by checking the check box next to each file, or by using

Auto select matching spectra.

The contents of all the selected spectra will be merged to create a one data matrix during

import.

Deselect all

Clear the current selection by unselecting all samples.

Preview spectra

Check to review a plot of selected spectra before importing.

Sample naming…

Include sample names or sample numbers in the resulting data table.

Sample names will only be imported if they are present in the source file.

Interpolate

130

Import

By checking the Interpolate option this allows the import of data with different

starting and ending points, provided the number of points is the same in all sets to

be imported.

When the % button is selected, the following dialog appears allowing a user to set

the Tolerance for allowing data with different start or end points to be imported.

Interpolate Tolerance Dialog

The Auto select matching spectra preview option provides automatic selection of all data

file(s) with the same wavelength ranges as the current selection. This dialog is used for

import of spectral data from instruments with OPUS file format. A screenshot of the OPUS

Import dialog with the auto select option chosen is given below.

Once Auto select matching spectra has been checked, the files in the list having the same

number of variables will be selected.

Use the Interpolate option to import data with different start or end points.

131

The Unscrambler X Main

Sorting data

The file name, number of X-variables, wavelengths for the first and last X-variables are

displayed for each file.

The data table resulting from the import can be sorted based on any of these columns in the

file list: Click on a column header to set sort order, and a second time to reverse the sort

order.

Preview

Preview spectra displays a line plot of selected files that have been selected for import.

5.5. DataBase

5.5.1 Databases

Type of data

Array

Software

132

Import

How to use it

This feature allows a user to import data from a wide selection of databases that are

ODBC/ADO compliant.

Data can be imported from a database into a project in The Unscrambler®.

Since there are many possible database platforms and the data structure may be complex,

the user must go through several tabs in order to specify the import:

Connection: Server address and user authentication

Advanced: Network settings

All: Initialization properties

Note: The Data Link Properties dialog is a standard Windows dialog. Depending on

the local language setup, this dialog may be displayed in another language other

than English. The name of the dialog will be different, the fields will have a different

text, but the layout and meaning of all fields will be the same as described

hereafter. For additional information, click Help; this will start the Microsoft help

system related to the current sheet in the Data Link Properties dialog.

The next two sections describe the standard stages to go through in order to establish a

connection from The Unscrambler® to a database.

In the Provider tab of the Data Link Properties dialog, select the database provider to

import from.

Data Link Properties, Provider sheet

133

The Unscrambler X Main

In the Connection sheet of the Data Link Properties dialog, locate the desired database from

the proper server and specify the security settings for logging on to the database.

Data Link Properties, Connection sheet

134

Import

Specify the source of data prompts for a choice between:

Use data source name

select from the list, or type the ODBC database source name (DSN) to access. More

sources can be added through the ODBC Data Source Administrator. Refresh the list

by clicking Refresh, and

Use connection string

allows the user to type or build an ODBC connection string instead of using an

existing DSN.

Enter information to log on to the server: type the User name and Password to use

for authentication when logging on to the data source. Ticking box Blank password

enables the specified provider to return a blank password in the connection string.

Tick Allow saving password to allow the password to be saved with the connection

string.

Enter the initial catalog to use: type in the name of the catalog (or database), or

select from the drop-down list.

Once everything is specified, press Test Connection to check whether contact with the

desired database has been successfully established. If the connection fails, ensure that the

settings are correct. For example, spelling errors and case sensitivity can cause failed

connections.

135

The Unscrambler X Main

Go to the Advanced Tab to choose network settings, set connection timeout, and access

permissions.

Data Link Properties Advanced Tab

The All tab is provider-specific and displays only the initialization properties required by the

selected OLE DB provider.

Data Link Properties All Tab

136

Import

To edit a value, select it, and click the Edit Value… button, which opens the dialog where a

property can be changed.

From the List of tables, select the data table to access. The List of fields to the right is then

updated accordingly.

Select database tables

137

The Unscrambler X Main

Press the Next button to preview the data and proceed to complete the import.

Preview data before import

138

Import

The data types will be detected for individual columns and imported as numeric values or

text.

5.6. DeltaNu

5.6.1 DeltaNu

Type of data/instrument

Raman spectrometer

Data dimensions

single vector spectrum or multiple spectra in an array

Instrument/hardware

NuSpec software

Pharma-ID Raman spectrometer

Vendor

DeltaNu

File name extension

*.dnu, *.lib

How to use it

This option allows for the import of data files generated by the DeltaNu Raman

spectrometers using the NuSpec software. The files may have a single or multiple spectrum

in them. Typically the file extensions are .dnu or.lib, but are not limited to having such a file

extension.

This option allows a user to import data from the DeltaNu Pharma-ID Raman spectrometer

operating with NuSpec software. Files with the following file name extensions are

supported: .dnu.

From the File – Import Data menu, select DeltaNu. The DeltaNu dialog box displays a list of

files from which one can import data generated using the NuSpec software from DeltaNu. If

necessary, click the Browse button to access files from a different folder.

DeltaNu import

139

The Unscrambler X Main

Multiple selections are possible, by checking the box next to more than one file. The

selected samples must be of the same size (variables must match).

Multiple selections

Select one or more files to import by checking the check box next to each file, or by using

Auto select matching spectra.

The contents of all the selected spectra will be merged to create one data matrix during

import.

Deselect all

Clear the current selection by unselecting all samples.

Preview spectra

Check to review a plot of selected spectra before importing.

Sample naming…

Include sample names or sample numbers in the resulting data table.

Sample names will only be imported if they are present in the source file.

140

Import

The Auto select matching spectra preview option allows the automatic selection of all data

file(s) with the same wavelength ranges as the current selection. This dialog is used by

spectral data imports from instrument formats such as DeltaNu, GRAMS, OPUS, etc.

Sorting data

The file name, number of X-variables, wavelengths for the first and last X-variables, and step

(increase in wavelength), are displayed for each file.

The data table resulting from the import can be sorted based on any of these columns in the

file list: Click on a column header to set sort order, and a second time to reverse the sort

order.

Preview

Preview spectra displays a line plot of selected files that have been selected for import.

141

The Unscrambler X Main

5.7. Excel

5.7.1 Microsoft Excel spreadsheets

Type of data

Array (spreadsheet)

Software

Excel (part of Microsoft Office)

Vendor

Microsoft

File name extension

*.xls, *.xlt, *.xlsx, *.xlsm

142

Import

How to use it

Data in Excel Workbooks from Microsoft Excel 97 and newer can be imported:

The Unscrambler® supports the OOXML (Office Open XML) file format that was introduced

with Office 2007 with more than 255 columns. Users should remove any formatting from

spreadsheets before importing into The Unscrambler®.

Binary Excel 2007 workbooks with file name extension .xlsb are not supported.

The Excel Workbook files must have the file name extensions .xls or .xlsx to be

recognized by The Unscrambler®.

Note: The Unscrambler® supports the OOXML format (.xlsx file name extension)

with more than 255 columns.

Note: Users should remove any formatting (particularly borders) from spreadsheets

before importing into The Unscrambler®. To avoid data type recognition problems

on import, make sure there are no empty cells in first row of values.

From the menu choose File – Import Data – Excel… to select an Excel file to open. Once a file

has been selected the Excel Preview dialog opens. An Excel workbook may contain several

worksheets. Select the worksheet that contains the matrix to be imported from the drop-

down list Select sheet or named range.

Once the sheet or named range are selected, the data preview window will open. The

screenshot below shows the Excel preview window, which enables the user to select the

desired data sheet, header and data selection of rows and columns.

Excel Preview

143

The Unscrambler X Main

All ranges that have been defined with names in the selected Excel sheet are listed under

Range names. Multiple row and column headers can be specified in headers, with up to a

maximum of 5 headers.

The sheet range is updated automatically if a range name is selected. The range can also be

entered manually, specifying the Rows and Columns, e.g. 2:1. All cells lying within this

rectangle are then imported.

Select the appropriate ranges as described above for the data values from the selection

option, as well as for the rows/sample and columns/variable names, if relevant.

Columns and rows can be removed from the import by selecting them within the preview

grid and pressing Del on the keyboard.

Data type

If the worksheet contains non-numeric values or a mixture of numeric and non-numeric

values, they can be imported. The radio button Auto can be selected to detect the data

format in the Excel spreadsheet and maintain that on import. If all the data are non-numeric,

they can be imported as text by selecting the radio button text. If the spreadsheet has a mix

of text and numeric values, and one data type is selected, only data of that type will be

imported.

Skip lines

If there are rows of data at the top of the spreadsheet that you do not want to import, you

can use the Skip lines option to enter the number of lines from the top to skip.

5.8. GRAMS

5.8.1 GRAMS from Thermo Scientific

Type of data

Array

Data dimensions

Multiple spectra, constituents

Software

GRAMS

Vendor

Thermo Scientific (formerly Galactic)

File name extension

*.spc, *.cfl

How to use it

This format is from GRAMS, a software package developed by Galactic (now part of Thermo

Scientific), and available for data from many different instruments.

The data are stored in two different file types. Spectra are stored in binary files with the

.spc file name extension, and constituents are stored in ASCII files with the .cfl file name

extension. The two file types are connected so that if a .cfl file is imported into The

144

Import

Unscrambler® both spectra and constituents are read. If a .spc file is imported, the spectra

are read, and accompanying Y values can also be imported with them.

“X-values” (usually wavelengths) in .spc files are imported as X-variable names.

Constituents in .cfl files are imported as Y-variables. “Y-values” are imported as separate

column sets with the name of the Y values for the columns.

Some .spc files contain a log block. This may include file names and sample numbers. To

import these, one can select Sample naming… and designate whether to use one, both or

none of these fields.

The binary part of the log block (which usually contains the imaginary part of complex

spectral data) is not imported, nor is the ASCII part of the log.

One or several GRAMS .spc files can be imported into a project in The Unscrambler®.

Select the files to import from the file list in the GRAMS Import dialog box or use the Browse

button to obtain a list of available files. The different files must have the same number of X-

variables and the same contents in the Y-matrix to allow simultaneous import.

GRAMS Import

The source files may contain one or more samples per file (i.e. single spectra or multifiles1);

multiple selections allow one to import several samples with the same number of variables

at the same time. The dialog will include details about the files that are eligible for import. It

will show the number of samples per file, the number of X variables, number of Y variables,

and the starting and ending X variables.

145

The Unscrambler X Main

Multiple selections

Select one or more files to import by checking the check box next to each file, or by using

Auto select matching spectra.

The contents of all the selected spectra will be merged to create one data matrix during

import. If the data files also include Y values, these will also be imported.

Deselect all

Clear the current selection by unselecting all samples.

Preview spectra

Check to review a plot of selected spectra before importing.

Sample naming…

Include sample names or sample numbers in the resulting data table.

Sample names will only be imported if they are present in the source file.

Interpolate

By checking the Interpolate option this allows the import of data with different

starting and ending points, provided the number of points is the same in all sets to

be imported.

When the % button is selected, the following dialog appears allowing a user to set

the Tolerance for allowing data with different start or end points to be imported.

Interpolate Tolerance Dialog

The Auto select matching spectra preview option allows the automatic selection of the all

data file(s) with the same wavelength ranges as the current selection. A screenshot of the

GRAMS Import dialog with the auto select chosen is provided below.

146

Import

Once the Auto select matching spectra option has been checked it will select only those files

that have the same number of variables as the first selected file.

Use the Interpolate option to import data with different start or end points.

Sorting data

The file name, number of samples, number of X-variables, wavelengths for the first and last

X-variables are displayed for each file.

The data table resulting from the import can be sorted based on any of these columns in the

file list. Click on a column header to set sort order, and a second time to reverse the sort

order.

Preview

Preview spectra displays a line plot of selected files that have been selected for import.

147

The Unscrambler X Main

Multifiles are a specific kind of GRAMS file that has multiple spectra in a single file,

as opposed to a single spectrum per file.

5.9. GuidedWave

5.9.1 CLASS-PA & SpectrOn from Guided Wave

Type of data/instrument

spectrometer (UV, UV-vis, NIR)

Data dimensions

Single spectra, constituents

Instrument/hardware

CLASS-PA, SpectrOn

Vendor

148

Import

Guided Wave

File name extension

*.asc, *.scn, *.autoscan, *.gva

How to use it

This option allows one to import data from Guided Wave instruments. The data files

typically have the extension .asc, .scn, .autoscan, or .gva but may be another extension as

the file type is not defined strictly by the extension.

This option allows a user to import data from Guided Wave instrument files with the

following file name extensions: .asc, .scn, .autoscan.

From the File – Import Data menu, select CLASS-PA & SpectrOn. The Guided Wave dialog

box displays a list of files from which one can import CLASS-PA & SpectrOn data. If

necessary, click the Browse button to access files from a different folder.

CLASS-PA & SpectrOn import

Multiple selections are possible, by checking the box next to more than one file. The

selected samples must be of the same size (variables must match).

149

The Unscrambler X Main

Multiple selections

Select one or more files to import by checking the check box next to each file, or by using

Auto select matching spectra.

The contents of all the selected spectra will be merged to create one data matrix during

import.

Deselect all

Clear the current selection by unselecting all samples.

Preview spectra

Check to review a plot of selected spectra before importing.

Sample naming…

Include sample names, sample numbers or timestamps in the resulting data table.

Sample names will only be imported if they are present in the source file.

Interpolate

By checking the Interpolate option this allows the import of data with different

starting and ending points, provided the number of points is the same in all sets to

be imported.

When the % button is selected, the following dialog appears allowing a user to set

the Tolerance for allowing data with different start or end points to be imported.

Interpolate Tolerance Dialog

Y-variables

Constituents may also be imported by checking the following options:

Import Y-variables

Import Predicted Y-variables

150

Import

The Auto select matching spectra preview option allows the automatic selection of all data

file(s) with the same wavelength ranges as the current selection. This dialog is used by

spectral data imports from instrument formats such as CLASS-PA & SpectrOn GRAMS, OPUS,

etc. A screenshot of the Guided Wave Import dialog box with the auto select option chosen

is given below.

Use the Interpolate option to import data with different start or end points.

Sorting data

The file name, number of X-variables, wavelengths for the first and last X-variables, and step

(increase in wavelength), are displayed for each file.

The data table resulting from the import can be sorted based on any of these columns in the

file list: Click on a column header to set sort order, and a second time to reverse the sort

order.

Preview

Preview spectra displays a line plot of selected files that have been selected for import.

151

The Unscrambler X Main

5.10.1 Interpolate functionality

It is the common case, particularly with Fourier Transform (FT) spectrometers, when data is

collected on different instruments (of the same make), even though they have been

collected at the same resolution the starting and ending wavenumbers may be slightly

different.

When data is imported into The Unscrambler®, the import dialog relies on three important

pieces of information

The starting value of the spectra

The ending value of the spectra

152

Import

If there is a mismatch in any of these values, there are two possible scenarios

If the number of points in the spectra do not match to each other, a matrix cannot

be formed as it does not have the same column dimension

If the start points do not match, again a matrix cannot be formed, however, if the

differences between the values are small, interpolation can be used to match these

small differences.

The Interpolation function used in the Import menus is different from that found in Tasks -

Transform (which may be useful for trying to match data from two sets collected as different

resolutions).

Find out more about the Interpolate Transform here.

Data Imports Supporting Interpolation

The following file imports support the interpolate functionality in The Unscrambler® import

dialog boxes.

JAMP-DX

Thermo Galactic GRAMS

OPUS (Bruker Optics)

CLASS-PA & SpectrOn

Indico (ASD)

OMNIC™ (Thermo)

Varian

PerkinElmer

Functionality

When a file import supporting interpolate is selected, the Interpolate checkbox will be

present, see below

The % button opens the Tolerance dialog box that has a slider bar for setting how far

beyond the reference spectrum limit to set the interpolation.

Tolerance Dialog

Any points that lie within +/- the set percentage tolerance of the starting point will be

included in the import.

Example

Nine Spectra were collected on three different Bruker spectrometers using 8 wavenumber

resolution. Three replicate spectra were collected on each instrument. Each spectrum

153

The Unscrambler X Main

consists of 1154 points, however, the starting point of each spectrum is different. By

selecting the first spectrum and then checking the Auto select matching spectra box, only

the three first spectra are selected, see below,

To import all data into one table, check the Interpolate box and set the Tolerance to include

all spectra in the set, see below

When the Auto select matching spectra box is reselected, all spectra are now included in the

import, see below,

154

Import

The data are now displayed as a node in the project navigator using the column headers of

the reference spectrum selected.

5.11. Indico

5.11.1 Indico

Type of data/instrument

—

Data dimensions

Single spectra

Software

Indico Pro 5.6 (version 6 files)

RS3 5.6 (version 7 files)

Indico Pro 6.0 (version 8 files)

Vendor

ASD Inc.

File name extension

*.asd, *.001, *.002, *.3456, etc. (any number)

How to use it

This option allows for the import of data files created with the ASD Inc software. Current

ASD files that are supported for import are version 6, generated from Indico Pro 5.6, version

7, generated from RS3 5.6, and version 8 generated from Indico Pro 6.0.

155

The Unscrambler X Main

This option allows a user to import data files created with the ASD Inc. software Indico Pro

and RS3. Source files with the following file name extensions are supported: .asd, .001,

.002, .3456, etc. (any number).

Select the files to import from the file list in the Indico Import dialog box or use the Browse

button to obtain a list of available files. The Indico Import dialog box displays a list of files

from which one may import Indico data. This includes the file names, the number of X-

variables, names of the First and Last X-variables and step size.

INDICO Import

The source files contain one sample per file; multiple selection allows for the import of

several files (samples) at the same time.

Multiple selections

Select one or more files to import by checking the check box next to each file, or by using

Auto select matching spectra.

The contents of all the selected spectra will be merged to create one data matrix during

import.

Deselect all

Clear the current selection by unselecting all samples.

Preview spectra

Check to review a plot of selected spectra before importing.

Sample naming…

Include sample names or sample numbers in the resulting data table.

Sample names will only be imported if they are present in the source file.

156

Import

Interpolate

By checking the Interpolate option this allows the import of data with different

starting and ending points, provided the number of points is the same in all sets to

be imported.

When the % button is selected, the following dialog appears allowing a user to set

the Tolerance for allowing data with different start or end points to be imported.

Interpolate Tolerance Dialog

The auto select matching spectra preview option allows the automatic selection of all data

file(s) with the same wavelength ranges as the current selection. This dialog is used by

spectral data imports from instrument formats such as Indico, GRAMS,OPUS etc. A

screenshot of the Indico Import dialog with the auto selection chosen is given below.

157

The Unscrambler X Main

Use the Interpolate option to import data with different start or end points.

Sorting data

The file name, number of X-variables, wavelengths for the first and last X-variables, and step

(increase in wavelength), are displayed for each file.

The data table resulting from the import can be sorted based on any of these columns in the

file list: Click on a column header to set sort order, and a second time to reverse the sort

order.

Preview

Preview spectra displays a line plot of selected files that have been selected for import.

158

Import

5.12. JcampDX

5.12.1 JCAMP-DX

Type of data/instrument

Vector and arrays. Standard

Data dimensions

Multiple spectra, constituents

Vendor

JCAMP/IUPAC

File name extensions

*.jdx, *.dx, *.jcm

How to use it

159

The Unscrambler X Main

This is a standard, portable data format defined by JCAMP to support exchange of chemical

and spectroscopic information.

It was originally a standard data format for IR, which has since been extended to

accommodate NMR, mass spec and other data, motivated by the desire to share data

irrespective of the spectrometer on which it was acquired and the need for long-term data

archival, well past the expected lifetime of current hardware and software.

Further development of JCAMP standards is now under the auspices of IUPAC.

One can import one or several JCAMP-DX files with .jdx, .dx, .jcm file name extensions

into a project in The Unscrambler®.

Select the files to import from the file list in the JCAMP-DX Import dialog box or use the

Browse button to get a list of available files.

The different files must have the same number of X-variables and the same contents in the

Y-matrix to allow simultaneous import.

JCAMP-DX Import

Multiple selections

Select one or more files to import by checking the check box next to each file, or by using

Auto select matching spectra.

The contents of all the selected spectra will be merged to create a one data matrix during

import.

Deselect all

160

Import

Preview spectra

Check to review a plot of selected spectra before importing.

Sample naming…

Include sample names or sample numbers in the resulting data table.

Sample names will only be imported if they are present in the source file.

By checking the Interpolate option this allows the import of data with different

starting and ending points, provided the number of points is the same in all sets to

be imported.

When the % button is selected, the following dialog appears allowing a user to set

the Tolerance for allowing data with different start or end points to be imported.

Interpolate Tolerance Dialog

The Auto select matching spectra preview option allows the automatic selection of all data

file(s) with the same wavelength ranges as the current selection.

161

The Unscrambler X Main

Use the Interpolate option to import data with different start or end points.

Sorting data

The file name, number of samples, number of X variables, number of Y variables, and

wavelengths for the first and last X-variables are displayed for each file.

The data table resulting from the import can be sorted based on any of these columns in the

file list: Click on a column header to set sort order, and a second time to reverse the sort

order.

Preview

Preview spectra displays line plots of selected files for import.

162

Import

This format is used by many spectroscopy instrument vendors, e.g. Bran+Luebbe

(IDAS/Infralyzer), NIRSystems (NSAS), Perkin Elmer, Thermo Fisher (Grams, Omnic), Bruker

(OPUS), etc.

General

JCAMP-DX are ASCII-files with file headers containing information about the data and their

origin, etc., and they may contain both X-data (spectra) and Y-data (concentrations).

Only the most essential information of the JCAMP-DX file will be imported. The first title in

the JCAMP-DX file will be used, and one has the additional option of also importing file

names and sample numbers. There is not a limit on the length of a file name. If several

JCAMP-DX files are imported and saved in the same Unscrambler® file, the matrix name will

be that of the first file imported JCAMP-DX file.

JCAMP “X-values” (usually wavelengths) become X-variable names, while JCAMP “Y-values”

become X-variable values. “Concentrations” are interpreted as Y-variables. Variable names

are imported, with no limit on the number of characters. The “Sample description” are used

163

The Unscrambler X Main

as sample names. Unfortunately there are different dialects of JCAMP-DX, so in some cases

one may lose e.g. sample names if they were used erroneously in the original file.

The XYPOINT variant demand more disk space than XYDATA.

Examples of the XYDATA and XYPOINTS formats follows.

JCAMP-DX XYPOINTS

The example below shows only one sample.

##JCAMP-DX= 4.24 $IDAS 1.40

##DATA TYPE= NEAR INFRARED SPECTRUM

##ORIGIN= Bran+Luebbe Analyzing Technologies

##OWNER= Applications Laboratory

##DATE= 92/ 6/10 $$ WED

##TIME= 1: 0: 3

##BLOCKS= 14

##SAMPLE DESCRIPTION= WHE202CH $$ 1.00

##SAMPLING PROCEDURE= DIFFUSE REFLECTION

##DATA PROCESSING= LOG(1/R)

##XUNITS= NANOMETERS

##YUNITS= ABSORBANCE

##XFACTOR= 1.0

##YFACTOR= 0.000001

##FIRSTX= 1445

##LASTX= 2348

##FIRSTY= 0.652170

##MINY= 0.552445

##MAXY= 1.258505

##NPOINTS= 19

##CONCENTRATIONS= (NCU)

(<CARBOHYDRATE>, 89.400, %)

(<PROTEIN>, 9.410, %)

##XYPOINTS= (XY..XY)

1445, 652170; 1680, 555209; 1722, 606660; 1734, 612745;

1759, 604142; 1778, 575455; 1818, 552445; 1940, 631510;

1982, 657704; 2100, 1188830; 2139, 1082772; 2180, 1008640;

2190, 999405; 2208, 951049; 2230, 978299; 2270, 1198344;

2310, 1258505; 2336, 1209149; 2348, 1153169;

##END=

JCAMP-DX XYDATA

The example below shows only one sample.

##JCAMP-DX= 4.24 $IDAS 1.40

##DATA TYPE= NEAR INFRARED SPECTRUM

##ORIGIN= Bran+Luebbe Analyzing Technologies

##OWNER= Applications Laboratory

##DATE= 92/ 7/ 9 $$ THU

##TIME= 20:53:17

##BLOCKS= 14

##SAMPLE DESCRIPTION= COF12BUS $$ 1.00

##SAMPLING PROCEDURE= DIFFUSE REFLECTION

##DATA PROCESSING= LOG(1/R)

164

Import

##XUNITS= NANOMETERS

##YUNITS= ABSORBANCE

##XFACTOR= 1.0

##YFACTOR= 0.000001

##FIRSTX= 1100

##LASTX= 2500

##FIRSTY= 0.139460

##MINY= 0.131600

##MAXY= 1.380070

##NPOINTS= 281

##CONCENTRATIONS= (NCU)

(<CARBOHYDRATE>, 89.400, %)

(<PROTEIN>, 9.410, %)

##DELTAX= 5

##XYDATA= (X++(Y..Y))

1100 139459 137435 135089 133060 131669 131599 133794 138899

1140 145740 151897 158459 167527 180800 195522 206585 216499

...

...

2460 1378929 1379632 1378464 1374972 1378929 1376837 1372945 1377632

2500 1380069

##END=

The appropriate parameters in this field will be written to the JCAMP exported file.

Please feel free to include more parameters in the file if necessary . The user can type any

information into the field, but only text in the format ##KEYWORD = ..., as listed below, will

be used during export.

JCAMP keywords

Keyword Legal values

BASELINEC= YES or NO

APCOM= String60

JCAMP-DX= String

ORIGIN= String

5.13. Konica_Minolta

5.13.1 Konica_Minolta

Type of data/instrument

KONICA MINOLTA NIR spectrometer

Data dimensions

single vector spectrum or multiple spectra in an array

Instrument/hardware : :

Vendor

165

The Unscrambler X Main

Konica_Minolta

File name extension :

How to use it

This option allows for the import of data files created with KONICA MINOLTA NIR

spectrometer.

This option allows a user to import data files from KONICA MINOLTA NIR spectrometer. This

option would directly connect the spectrometer and acquire data. This import also supports

ASCII file import.

Select the ASCII files to import from Import Button in the Konica_Minolta Import dialog box.

Konica_Minolta Import

Upon selection of ASCII files the spectrum is displayed in the dialog box as a line plot. After

selecting multiple files user can click on OK to get the data in Import.

Konica_Minolta Import

166

Import

The contents of all the spectra in dialog will be merged to create one data matrix after

import.

Delete

Deletes the selected spectra

Rename

Option to rename the name of spectra

Select/DeSelect

Use Mouse left button to select/unselect the spectra for viewing the plots

5.14. Matlab

5.14.1 Matlab

Type of data

Array

Software

Matlab

Vendor

MathWorks, Inc.

File name extension

*.mat

How to use it

167

The Unscrambler X Main

MATLAB is a numerical computing environment and fourth generation programming

language.

The Unscrambler® allows for the import of data from Matlab data files created with Matlab

versions 5.x to 7.0.

The following cannot be imported from Matlab to The Unscrambler®

Cells arrays,

Structures,

Sparse matrices.

Use the save command in Matlab:

or save destinationfilename to save all variables in the workspace.

This will create a Matlab formatted .mat file. For more help on using the save command,

type help save in Matlab.

This option allows for the import of data from Matlab formatted files created in Matlab

versions 5.x to 7.0.

To import the file in The Unscrambler® select File - Import Data - Matlab. Select the

destination filename in The Unscrambler® to get the Import Matlab dialog box.

Select which selections represent the Data, Sample names and Variable names. The sample

name and variable name variables must match the corresponding dimension of the data

variable (for example, 5 rows and 4 columns in the figure below) or they will not be

displayed in the drop-down lists with available sample and variable names.

Import Matlab dialog

168

Import

Matlab variables representing sample and variable names must be character arrays.

What Cannot be Converted

The following cannot be imported from Matlab to The Unscrambler®

Cells arrays,

Structures,

Sparse matrices.

Use the save command in Matlab:

or save destinationfilename to save all variables in the workspace.

This will create a Matlab formatted .mat file. For more help on using the save command,

type help save in Matlab.

5.15. MyInstrument

5.15.1 MyInstrument

Type of data/instrument

Instrument interface standard defined by Thermo Electron (formerly Galactic) and

supported by many instrument vendors.

A MyInstrument driver provided by the specific instrument vendor and the

corresponding MyInstrument add-on for The Unscrambler® are required. These

modules are available separately from CAMO Software and many not be part of the

standard package.

Additional information

How to use it

The MyInstrument add-on for The Unscrambler® provides users with the ability to directly

acquire spectra from their spectrometers into The Unscrambler®. The acquisition process

169

The Unscrambler X Main

makes use of the MyInstrument standard to allow for instrument configuration and

definition of experiments in order to run scans. The functionality provided is dependent on

the instrument. After acquisition the spectral data is directly inserted as rows per scan into

an The Unscrambler® editor, ready for further processing or modeling. The MyInstrument

add-on removes the need for acquiring data using other instrument specific software, saving

to a file and then importing into The Unscrambler®.

Working with the MyInstrument add-on

Start a session in The Unscrambler® and use the menu item which typically has the vendor

company name followed by MyInstrument…, e.g. for a Zeiss instrument: File – Import Data –

Zeiss MyInstrument…

The next window will show the vendor specific MyInstrument control screen, e.g. for a Zeiss

instrument:

170

Import

The appearance and usage of the control dialog will depend on the particular instrument

vendor. Details of using the instrument interface will be available from the manuals provided

by the instrument vendor. Using the instrument may require specific configuration and

setup procedures provided by the vendor before being able to run scans.

171

The Unscrambler X Main

Sample scan result. This may appear entirely different for the instrument being used and are

provided here only as an example.

Click OK to end the scan acquisition session. The scans should now be available within The

Unscrambler® editor for subsequent processing and modeling.

172

Import

5.16. NetCDF

5.16.1 NetCDF

Type of data

Open standard for array-oriented data

Developed by

University Corporation for Atmospheric Research (UCAR)

File name extension

*.cdf, *.nc

How to use it

NetCDF (network Common Data Form) is a set of software libraries and machine-

independent data formats that support the creation, access, and sharing of array-oriented

scientific data.

What Is NetCDF?

NetCDF (network Common Data Form) is a set of interfaces for array-oriented data access

and a freely-distributed collection of data access libraries for C, Fortran, C++, Java, and other

languages. The NetCDF libraries support a machine-independent format for representing

scientific data. Together, the interfaces, libraries, and format support the creation, access,

and sharing of scientific data.

NetCDF data is:

Portable. A NetCDF file can be accessed by computers with different ways of storing

integers, characters, and floating-point numbers.

Scalable. A small subset of a large data set may be accessed efficiently.

Appendable. Data may be appended to a properly structured NetCDF file without

copying the data set or redefining its structure.

Sharable. One writer and multiple readers may simultaneously access the same

NetCDF file.

Archivable. Access to all earlier forms of NetCDF data will be supported by current

and future versions of the software.

The NetCDF software was developed by Glenn Davis, Russ Rew, Ed Hartnett, John Caron,

Steve Emmerson, and Harvey Davies at the Unidata Program Center in Boulder, Colorado,

with contributions from many other NetCDF users.

NetCDF (network Common Data Form) is a set of software libraries and machine-

independent data formats that support the creation, access, and sharing of array-oriented

scientific data.

173

The Unscrambler X Main

Select the files to import from the file list in the dialog NetCDF Import or use the Browse

button to get a list of available files.

Select a .cdf file to import and then click Open.

NetCDF Import dialog

One can select Sample Names and Variable names as shown above.

5.17. NSAS

5.17.1 NSAS

Type of data/instrument

NIR

Data dimensions

Multiple spectra, constituents

Instrument/hardware

Foss 5000, 6500, XDS

Vendor

FOSS

File name extension

*.da, *.cn, *.cal

How to use it

NSAS file format originates from FOSS NIRSystems NIR instruments, and is a format from

their DOS-based NSAS software. Files can be saved from the FOSS WINISI software and FOSS

Vision software into the NSAS format.

See the technical reference for an overview of instrument parameters that The

Unscrambler® can import from NSAS data files.

174

Import

NSAS data import allows the import of NIR spectral data files generated by FOSS instruments

and accompanying constituents from the NSAS file format, which have the .da and .cn file

name extensions respectively.

Select the files to import from the file list in the dialog NSAS Import or use the Browse

button to get a list of available files. The different files must have the same number of X-

variables and the same contents in the Y-matrix to allow simultaneous import.

NSAS Import

The source files may contain one or more samples per file; multiple selections allow several

samples to be imported at the same time.

Multiple selections

Select one or more files to import by checking the check box next to each file, or by using

Auto select matching spectra.

The contents of all the selected spectra will be merged to create a one data matrix during

import.

Deselect all

Clear the current selection by unselecting all samples.

Preview spectra

Check to review a plot of selected spectra before importing.

Sample naming…

Include sample names or sample numbers in the resulting data table.

Sample names will only be imported if they are present in the source file.

175

The Unscrambler X Main

Auto select matching spectra preview option provides the automatic selection of all data

file(s) with the same wavelength ranges as the current selection. This dialog is used by input

spectral data from instruments with NSAS file format, as well as others. A screenshot of the

NSAS Import dialog with the auto select option chosen is given below.

Once Auto select matching spectra has been checked it will select the files having the same

number of variables from the list.

Sorting data

The file name, number of samples, number of X-variables, wavelengths for the first and last

X-variables are displayed for each file.

The data table resulting from the import can be sorted based on any of these columns in the

file list: Click on a column header to set sort order, and a second time to reverse the sort

order.

176

Import

Preview

Preview spectra displays a line plot of selected files that have been selected for import.

This document describes the instrument parameters that can be imported from NSAS data

files. Files can be saved from the FOSS WINISI software and FOSS Vision software into the

NSAS format.

Instrument parameters from NSAS files

NSAS Data Import will read information in the NSAS data file which has no natural place in

The Unscrambler® file format into the Instrument Info block under specific keywords.

Similarly, NSAS/Vision Model Export will look for a relevant subset of these keywords and, if

found, it will place the values in the corresponding places in the NSAS/Vision Model file.

The NSAS/Vision keywords are listed below.

NSAS/Vision keywords

Keyword Legal values

177

The Unscrambler X Main

NSAS_AmpType String: 1

NSAS_CellType String: 2

NSAS_Volume String: 3

NSAS_Math2_Type =

NSAS_Math3_Type =

NSAS_Math2_SegmentSize =

NSAS_Math3_SegmentSize =

NSAS_Math2_GapSize =

NSAS_Math3_GapSize =

NSAS_Math2_DivisorPoint =

NSAS_Math3_DivisorPoint =

NSAS_Math2_SubtractionPoint =

NSAS_Math3_SubtractionPoint =

178

Import

NSAS_AmpType | String:

“Reflectance”, “Transmittance”, “(Reflect/Reflect)”, “(Reflect/Transmit)”,

“(Transmit/Reflect)”, “(Transmit/Transmit)”, “Not used”

NSAS_CellType | String:

“Standard sample cup”, “Manual”, “Web analyzer”, “Coarse sample”, “Remote

reflectance”, “Powder module”, “High fat/moisture”, “Rotating drawer”, “Flow-

through liquid”, “Cuvette”, “Paste cell”, “Cuvette cell”, “3 mm liquid cell”, “30 mm

liquid cell”, “Coarse sample with sample dump”

NSAS_Volume | String:

“1/4 full”, “1/2 full”, “3/4 full”, “Completely full”

1 = “N-point smooth”, 2 = “Reflective energy”, 3 = “Kubelka-Munk”, 4 = “1st

derivative”, 5 = “2nd derivative”, 6 = “3rd derivative”, 7 = “4th derivative”, 8 =

“Savitsky & Golay”, 9 = “Divide by wavelength”, 10 = “Fourier transform”, 11 =

“Correct for reference changes”, 13 = “Full MSC”, 21 = “N-point smooth”, 22 = “1st

derivative”, 23 = “2nd derivative”, 31 = “Savitzky-Golay first derivative”

5.18. Omnic

5.18.1 OMNIC

Type of data/instrument

FTIR, FT-NIR, Raman

Data dimensions

Single spectra

179

The Unscrambler X Main

Instrument/hardware

Nicolet IR, Antaris, NXR

Vendor

Thermo Scientific (Nicolet)

File name extension

*.spa, *.spg

How to use it

Data generated by Thermo molecular spectroscopy instruments and related OMNIC

software.

This option allows for the import of data from OMNIC files generated by ThermoFisher

instruments and related software.

Source files with .spa or .spg file name extension are supported.

Selecting the OMNIC dialog box displays a list of files from which one can import OMNIC

data.

If necessary, click the Browse button close to the Look in: field in order to access files from a

different folder.

OMNIC Import

The source files contain one sample per file. Multiple selection allows several files (samples)

to be imported at the same time.

180

Import

Multiple selections

Select one or more files to import by checking the check box next to each file, or by using

Auto select matching spectra.

The contents of all the selected spectra will be merged to create a one data matrix during

import.

Deselect all

Clear the current selection by unselecting all samples.

Preview spectra

Check to review a plot of selected spectra before importing.

Sample naming…

Include sample names or sample numbers in the resulting data table.

Sample names will only be imported if they are present in the source file.

By checking the Interpolate option this allows the import of data with different

starting and ending points, provided the number of points is the same in all sets to

be imported.

When the % button is selected, the following dialog appears allowing a user to set

the Tolerance for allowing data with different start or end points to be imported.

Interpolate Tolerance Dialog

Auto select matching spectra preview option allows the automatic selection of all data file(s)

with the same wavelength ranges as the current selection. This dialog is used by input

spectral data from instruments with OMNIC file format. A screenshot of the OMNIC Import

dialog with the auto select chosen is given below.

181

The Unscrambler X Main

Once the Auto select matching spectra option has been checked it will select the files have

the same variables from the list.

Use the Interpolate option to import data with different start or end points.

Sorting data

The file name, number of X-variables, wavelengths for the first and last X-variables, and step

(increase in wavelength), are displayed for each file.

Step is the increment in wavelength (or wave number) between two successive variables.

The following relationship should be true:

The data table resulting from the import can be sorted based on any of these columns in the

file list: Click on a column header to set sort order, and a second time to reverse the sort

order.

Preview

Preview spectra displays a line plot of selected files that have been selected for import.

182

Import

5.19. OPC

5.19.1 OPC protocol

Type of data/instrument

Standard data transfer protocol

Vendor

OPC Foundation

File format information

How to use it

OPC (originally OLE for process control) is a non-proprietary technical specification created

with the collaboration of a number of leading worldwide automation hardware and software

suppliers, working in cooperation with Microsoft under the auspices of the OPC Foundation.

The original standard provided specifications for process data acquisition, making possible

interoperability between automation/control applications, field systems/devices and

183

The Unscrambler X Main

automation data between PC-based clients using Microsoft operating systems. In 2009 a

new standard, OPC Unified Architecture, was developed, providing specifications for cross-

platform capability .

An OPC Server is often referred to as an OPC Driver. The two terms are synonymous.

An OPC Server is a software application that acts as an API (Application Programming

Interface) or protocol converter. An OPC Server will connect to a device such as a PLC, DCS,

RTU, or a data source such as a database or User interface, and translate the data into a

standard-based OPC format. OPC compliant applications such as a HMI (Human Machine

Interface), historian, spreadsheet, trending application, etc can connect to the OPC Server

and use it to read and write device data. An OPC Server is analogous to the role a printer

driver plays to enable a computer to communicate with an ink jet printer. An OPC Server is

based on a Server/Client architecture.

Data can be imported into The Unscrambler® via OPC. This requires a connection with an

OPC server. Begin by selecting File – Import Data – OPC… to open the OPC Dialog menu.

OPC Dialog

All configured servers on the PC will be recognized, and displayed in the list of OPC servers.

The user must make selections for the Computer name/IP, the OPC Server, and the OPC

Group from the respective drop-down lists. The user also has provision to type in computer

name/IP, the OPC server, and the OPC Group. Once they have been selected, available items

will be given in the OPC Items list. An item is selected, and by clicking on GO, the data will be

generated from OPC, and populate the fields in the OPC Import Dialog. Click Stop to stop the

collection process from OPC, showing the data in the preview.

OPC Tag - The user should use this option to specify the OPC tag. This should be used when

more OPC groups and OPC items are available in Servers. The user can directly specify the

tag to avoid the delay in listing and selecting individual OPC group and OPC item.

184

Import

Update Rate - This is the rate(in milliseconds) at which data is retrieved from the OPC

Server.

Show preview - User should check this option to see the last 10 rows retrieved from the OPC

Server.

Set number of columns - The user should use this option to increase the number of

columns.

Filled OPC Dialog

5.20. OSISoftPI

5.20.1 PI

Type of data

PI Server - real time data collection, archiving and distribution engines

How to use it

PI Import is an add-in that retrieves tags from compiled PI archives and servers, and writes

the data in The Unscrambler workbook which can then be used for regular plotting,

transformation and multivariate analysis. Tags are unique storage points for the data in the

PI system. Each tag is simply a single point of measurement.

Data can be imported into The Unscrambler® via OSISoft PI.

185

The Unscrambler X Main

The PI Import dialog allows the user to specify and connect to an active server. Click Add to

search a PI Server for tags using the Tag Search dialog. This dialog allows the user to search

all connected PI Servers for tags meeting a given a set of criteria, such as one or more tag

attribute values. Tags can be selected using the Search option. Three different search

options are available in Tag Search dialog, the Basic, Advanced and Alias.

Tag Search dialog

After the tags are selected (use Ctrl key for multiple tag selection) from the search list panel

and OK is clicked, they can be seen in the Tags window of the PI Import dialog. For more

details on options available in Tag Search dialog box, click on Help.

The below three sections describe the data modes to go through in order to preview and

retreive data for the selected tags from the PI server.

This mode will search the archive data specified within time ranges. For each tag, the values

recorded in the PI data source will be retrieved, within the specified time range and

previewed in the preview list. The timestamp (for the specified tag in Tag No) can either be

imported as row header or first column from the tag.

Data Mode, Archive

186

Import

The polling mode retrieves fresh data based on timer-driven method for any of the three

events selected. The time interval can be selected in seconds and the Start Timer option will

watch for new data. For each tag, the new values recorded in the PI data source will be

retrieved, and can be previewed in the preview list. The timestamp (for the specified tag in

Tag No) can either be imported as row header or first column from the tag.

Data Mode, Polling

187

The Unscrambler X Main

The event driven method retrieves fresh data based on any of the three events selected. The

Start Monitoring option will watch for new data. For each tag, the new values recorded in

the PI data source will be retrieved, and can be previewed in the preview list. The timestamp

(for the specified tag in Tag No) can either be imported as row header or first column from

the tag.

Data Mode, Event

188

Import

The help option available in the PISDKUtility provides more details about the usage of PI-SDK

configuration utility.

5.21. PerkinElmer

5.21.1 PerkinElmer

Type of data/instrument

UV-Vis, NIR, FTIR, Raman

Data dimensions

Multiple spectra

Instrument/hardware

—

Software

Spectrum 6, Spectrum 10

Vendor

PerkinElmer

File name extension

*.sp, *.spp

How to use it

189

The Unscrambler X Main

One or several spectra from files generated by PerkinElmer molecular spectroscopy

instruments (FTIR, Raman and UV-vis) using Spectrum 6 and Spectrum 10 software can be

imported.

When multiple spectra are contained in a file, the preference is to import the normalized

spectrum. However if a file contains a single spectrum (sample or reference alone), then

these will be imported.

This option supports the import of data from files generated by some PerkinElmer

instruments.

In the PerkinElmer Import dialog box, one can choose a folder where files are stored. A list

of files from which data can be imported is then displayed.

Note: Multiple files that vary in their spectral range and resolution cannot be

imported together.

Select the files to import from the file list in the dialog or use the Browse button to get a list

of available files. The different files must have the same number of X-variables to allow

simultaneous import.

PerkinElmer Import

Multiple selections

Select one or more files to import by checking the check box next to each file, or by using

Auto select matching spectra.

190

Import

The contents of all the selected spectra will be merged to create a one data matrix during

import.

Deselect all

Clear the current selection by unselecting all samples.

Preview spectra

Check to review a plot of selected spectra before importing.

Sample naming…

Include sample names or sample numbers in the resulting data table.

Sample names will only be imported if they are present in the source file.

By checking the Interpolate option this allows the import of data with different

starting and ending points, provided the number of points is the same in all sets to

be imported.

When the % button is selected, the following dialog appears allowing a user to set

the Tolerance for allowing data with different start or end points to be imported.

Interpolate Tolerance Dialog

Use the Interpolate option to import data with different start or end points.

The Auto select matching spectra preview option provides automatic selection of all data

file(s) with the same wavelength ranges as the current selection. This dialog is used for

import of spectral data from PerkinElmer instruments. A screenshot of the dialog with the

auto select option chosen is given below.

191

The Unscrambler X Main

Once Auto select matching spectra has been checked, the files in the list having the same

number of variables will be selected.

Use the Interpolate option to import data with different start or end points.

Sorting data

The file name, number of X-variables, wavelengths for the first and last X-variables are

displayed for each file.

The data table resulting from the import can be sorted based on any of these columns in the

file list: Click on a column header to set sort order, and a second time to reverse the sort

order.

Preview

Preview spectra displays a line plot of selected files that have been selected for import.

192

Import

5.22. PertenDX

5.22.1 Perten-DX

Type of data/instrument

Vector and arrays. Standard

Data dimensions

Multiple spectra, constituents

Vendor

Perten Instruments following JCAMP/IUPAC

File name extensions

*.jdx, *.dx, *.jcm

How to use it

193

The Unscrambler X Main

This is a standard, portable data format defined by JCAMP and modified by Perten to

support few of the specific Perten types

It was originally a standard data format for IR, which has since been extended to

accommodate NMR, mass spec and other data, motivated by the desire to share data

irrespective of the spectrometer on which it was acquired and the need for long-term data

archival, well past the expected lifetime of current hardware and software.

Further development of JCAMP standards is now under the auspices of IUPAC.

One can import one or several Perten-DX files with .jdx, .dx, .jcm file name extensions

into a project in The Unscrambler®.

Select the files to import from the file list in the Perten-DX Import dialog box or use the

Browse button to get a list of available files.

The different files must have the same number of X-variables and the same contents in the

Y-matrix to allow simultaneous import.

Perten-DX Import

Multiple selections

Select one or more files to import by checking the check box next to each file, or by using

Auto select matching spectra.

The contents of all the selected spectra will be merged to create a one data matrix during

import.

Deselect all

194

Import

Preview spectra

Check to review a plot of selected spectra before importing.

Sample naming…

Include sample names or sample numbers in the resulting data table.

Sample names will only be imported if they are present in the source file.

By checking the Interpolate option this allows the import of data with different

starting and ending points, provided the number of points is the same in all sets to

be imported.

When the % button is selected, the following dialog appears allowing a user to set

the Tolerance for allowing data with different start or end points to be imported.

Interpolate Tolerance Dialog

The Auto select matching spectra preview option allows the automatic selection of all data

file(s) with the same wavelength ranges as the current selection.

195

The Unscrambler X Main

Use the Interpolate option to import data with different start or end points.

Sorting data

The file name, number of samples, number of X variables, number of Y variables, and

wavelengths for the first and last X-variables are displayed for each file.

The data table resulting from the import can be sorted based on any of these columns in the

file list: Click on a column header to set sort order, and a second time to reverse the sort

order.

Preview

Preview spectra displays line plots of selected files for import.

196

Import

This format is based on JCAMP-DX file format. For more information on JCAMP-DX see the

section on Import JCAMP File Format

General

Perten-DX supports additional tags specific to Perten Instruments. These are:

Tag name Imported in Unscrambler as

197

The Unscrambler X Main

Perten-DX file

The example below shows Perten-DX sample file.

##TITLE=2

##INSTRUMENT S/N=1201530

##INSTRUMENT TYPE=DA7250

##SPECTROMETER S/N=SNIR2148

##JCAMP-DX=4.24

##DATATYPE= NEAR INFRARED SPECTRUM

##LONG DATE=2013-10-18T01:59:18+02:00

##SAMPLE DESCRIPTION=2

##SMOOTHED=YES

##XUNITS= Nanometers (nm)

##YUNITS= Absorbance

##CONCENTRATIONS= (NCU)

(Protein Dry basis,-9.973E+23,<unknown>)

##PERTEN-TYPES= (KV)

(Product Type, Wheat),

(Shape Type, Unknown),

(Tray Type, Large Tray. rotating)

##PERTEN-REPACK=1

##PERTEN-REPEAT=1

##PERTEN-SAMPLEINFO= (KV)

##XFACTOR= 1.0

##YFACTOR= 0.000000001

##FIRSTX= 950.00

##LASTX= 1650.00

##NPOINTS= 141

##DELTAX= 5.0

##XYDATA= (X++(Y..Y))

950.0 186225975 188992413 193629553 199835249 207323496 215294014

222310809 227316331 230163481

995.0 231218537 230973747 229930179 228344771 226101418 223436221

220348573 216993825 213526732

1040.0 210076812 206678859 203519066 200372073 197183083 193896477

190813849 187961026 185361544

1085.0 183060794 181031311 179367942 178144637 177316150 176997467

177158004 178485737 182057610

1130.0 189131917 200696556 216125124 233953784 253292157 272636547

291094037 307752989 322292848

1175.0 335720686 348497384 360603909 370580710 377233357 380561567

380739361 377437577 370749286

1220.0 361610474 351741516 342353572 334328973 327783482 322877222

319254364 316585214 314597761

1265.0 313006114 311340643 309259709 306673122 303654410 300820687

298877629 297995673 298450579

198

Import

357349953 373092331 389380072

1355.0 405360164 420025538 432690507 443690839 453913399 465033895

478927915 497519241 520603469

1400.0 547701532 578341832 610554253 641977198 670671475 694941644

714033309 728135504 737936222

1445.0 744584470 748870234 751802130 753593537 754701424 754774651

753793482 752142124 750221679

1490.0 747923597 745168624 742032801 738770350 735344011 731975306

728708573 725796673 723188418

1535.0 721043949 719373104 717859979 716709549 715573447 714720046

713740590 712450919 710535970

1580.0 708248969 705216090 701261550 696380943 690796672 684905943

678981726 673139165 666952182

1625.0 661182311 655418737 649996320 644795947 640163793 636351883 0 0 0

##END= $$ 2

5.23. RapID

5.23.1 RapID

Type of data

Array

Data dimensions

single vector spectrum

Instrument/hardware

Particle size analysers

Raman Spectrometers

Laser Induced Breakdown Spectrometers (LIBS)

Vendor

rap-ID Particle Systems

File name extension

.txt,.jcm

How to use it

This option allows for the import of .txt and.jcm data from rap-ID particle size analyzers

instrument files.

One or several rap-ID files (.txt or.jcm) can be imported into a project in The Unscrambler®.

Select the files to import from the file list in the RAP-ID Import dialog or use the Browse

button to display a list of available files. The different files must have the same number of X-

variables to allow simultaneous import.

RAP-ID Import

199

The Unscrambler X Main

Multiple selections

Select one or more files to import by checking the check box next to each file, or by using

Auto select matching spectra.

The contents of all the selected spectra will be merged to create one data matrix during

import.

Deselect all

Clear the current selection by unselecting all samples.

Preview spectra

Check to review a plot of selected spectra before importing.

Sample naming…

Include sample names or sample numbers in the resulting data table.

Sample names will only be imported if they are present in the source file.

200

Import

The Auto select matching spectra preview option allows the automatic selection of the all

data file(s) with the same wavelength ranges as the current selection. A screenshot of the

RAP-ID Import dialog with the auto select chosen is provided below.

Once Auto select matching spectra has been checked it will select only those files that have

the same number of variables.

Sorting data

The file name, number of samples, number of X-variables, are displayed for each file.

The data table resulting from the import can be sorted based on any of these columns in the

file list: Click on a column header to set sort order, and a second time to reverse the sort

order.

Preview

Preview spectra displays a line plot of selected files that have been selected for import.

201

The Unscrambler X Main

5.24. U5Data

5.24.1 U5 Data

File name extension

*.UNS

How to use it

Imports data files from earlier versions of The Unscrambler� (versions 3.0 - 5.5). If the file

to be imported contains several matrices, a dialog pops up to let the user specify which

matrices to import.

202

Import

Note: The Unscrambler� recognizes the extensions: .UNS, .UNM, .UNP, and .CLA.

Rename the files if they have other extensions.

Imports data files from earlier versions of The Unscrambler® (versions 3.0 - 5.5). If the file to

be imported contains several matrices, all of the matrices will be available to import. The

user can define which matrices to import, When multiple matrices are selected, they will be

combined into a single matrix.

Select the files to import from the file list in the U5 Import dialog box or use the Browse

button to obtain a list of available files. The U5 Import dialog box displays a list of matrices

from which one may import U5 data. This includes the matrix names, the number of rows,

and the number of columns. When selecting multiple matrices, use the radio buttons at the

top to specify whether they should be combined in terms of rows or columns.

U5 Data import

203

The Unscrambler X Main

5.25. UnscFileReader

5.25.1 The Unscrambler® 9.8

Type of data

Array

Software

The Unscrambler® 9.8

Vendor

CAMO Software

File name extensions

*.??M, *.??D

How to use it

204

Import

The Unscrambler® X features a new file format, but files created by versions 9.2 to 9.8 can

be imported.

More details.

Import data and model matrices from files made by versions 9.2 to 9.8 of The Unscrambler®

into the Editor.

Select a file and the imported data and plots will appear in the project navigator.

Not all plots are available for models that were created in versions of The Unscrambler®

before 9.8. In such instances, the user is recommended to import the data, and rebuild the

models.

The Unscrambler® 9.x used the file name extensions listed below to distinguish between

different data types:

The Unscrambler® 9.x files File name extension

Statistics .10D

PCA .11M

Prediction .30D

Classification .31D

MLR .40M

PLS1 .41M

PLS2 .42M

205

The Unscrambler X Main

PCR .43M

MSC .50D

Each of the .??D files above may have the following corresponding additional files:

.??P Preference file (settings for the file when it closes)

.??T Notes file

.??W Warnings file

The Unscrambler® 9.8 introduced a merged file format combining .??[DLPTW] into one file,

.??M.

A few details to remember about the file sets that comprise each data table or saved result:

When transferring data to another place using the Windows Explorer, make sure

that all the associated physical files are copied!

Do not change the file name extensions The Unscrambler® uses. Doing so may

create problems to access the files from within The Unscrambler®.

The log and notes files are plain ASCII files which can be opened and viewed using a

text editor.

5.26. UnscramblerX

5.26.1 The Unscrambler® X

Type of data

Array

Software

The Unscrambler® X

Vendor

CAMO Software

File name extensions

*.unsb

How to use it

206

Import

The native file format used by The Unscrambler® X have the .unsb file name extension, a

proprietary binary format made specifically for The Unscrambler® to provide fast and

efficient storage of large data sets and multivariate models.

This option allows one to import data tables and models from another The Unscrambler® X

project file.

How to import data

Use File – Import Data – Unscrambler X…

After selecting the import target, click OK to enter the Import dialog.

207

The Unscrambler X Main

5.27. Varian

5.27.1 Varian

Type of data/instrument

—

Data dimensions

Multiple spectra, constituents

Instrument/hardware

Cary UV-Vis

Software

—

Vendor

Varian, Inc.

File name extension

*.bsw

How to use it

This option allows one to import data from files generated by Varian UV-Vis instruments and

related software.

Source files with .bsw file name extension are supported.

208

Import

This option allows one to import data from files generated by Varian instruments and

related software (Cary UV-Vis instruments).

Source files with .bsw file name extension are supported.

Selecting the Varian dialog box displays a list of files from which one can import Varian data.

If necessary, click the Browse button close to the Look in: field in order to access files from a

different folder.

VARIAN Import

The source files may contain one or more samples per file. Multiple selections allow several

samples to be imported at the same time.

Select one or more files to import by checking the check box next to each file, or by using

Auto select matching spectra.

The contents of all the selected spectra will be merged to create a one data matrix during

import.

Deselect all

Clear the current selection by unselecting all samples.

Preview spectra

Check to review a plot of selected spectra before importing.

Sample naming…

Include sample names or sample numbers in the resulting data table.

Sample names will only be imported if they are present in the source file.

209

The Unscrambler X Main

By checking the Interpolate option this allows the import of data with different

starting and ending points, provided the number of points is the same in all sets to

be imported.

When the % button is selected, the following dialog appears allowing a user to set

the Tolerance for allowing data with different start or end points to be imported.

Interpolate Tolerance Dialog

Use the Interpolate option to import data with different start or end points.

Auto select matching spectra preview option provides automatic selection of all the data

file(s) with the wavelength ranges as the current selection. This dialog is used by input

spectral data from instruments with Varian file format.

210

Import

Once the Auto select matching spectra option has been checked it will select the files having

the same variables from the list.

Use the Interpolate option to import data with different start or end points.

Sorting data

The file name, number of samples, number of X variables, number of Y variables, and

wavelengths for the first and last X-variables are displayed for each file.

The data table resulting from the import can be sorted based on any of these columns in the

file list: Click on a column header to set sort order, and a second time to reverse the sort

order.

Preview

Preview spectra displays a line plot of selected files that have been selected for import. A

screenshot of the Varian Import dialog with the preview spectra chosen is given below.

211

The Unscrambler X Main

5.28. VisioTec

5.28.1 VisioTec

Type of data/instrument :

Data dimensions

single vector spectrum or multiple spectra in an array

Instrument/hardware : :

Vendor

VisioTec

File name extension :

How to use it

212

Import

This option allows for the import of data files created with the Uhlmann VisioTec NIR

Inspection systems.

This option allows a user to import data files created with the Uhlmann VisioTec NIR

inspection systems. Source files with the following file name extensions are supported:

.ldfor ‘.dat’.

Select the files to import from the file list in the VisioTec Import dialog box or use the

Browse button to obtain a list of available files. The VisioTec Import dialog box displays a list

of files from which one may import VisioTec data. This includes the file names, the number

of X-variables, names of the First and Last X-variables and step size.

VisioTec Import

The source files may contain one or many samples per file; multiple selection allows for the

import of several files (blocks of data) at the same time.

Multiple selections

Select one or more files to import by checking the check box next to each file, or by using

Auto select matching spectra.

The contents of all the selected spectra will be merged to create one data matrix during

import.

Deselect all

Clear the current selection by unselecting all samples.

Preview spectra

213

The Unscrambler X Main

Sample naming…

Include sample names or sample numbers in the resulting data table.

Sample names will only be imported if they are present in the source file.

214

6. Export

6.1. Exporting data

This section describes how to export data from The Unscrambler®.

The Unscrambler® can export data in the following data formats:

ASCII

JCAMP-DX

NetCDF

Matlab

AMO: The Unscrambler® ASCII Model

DeltaNu

Select a format from the File – Export menu, which will open an Export dialog specific to the

given file format.

After selecting the model, or the data matrix and range to export, entering meta data and

other storage options, press OK to specify the directory and file name to save the exported

data to.

6.2. AMO

6.2.1 Export models to ASCII

The Unscrambler® ASCII-MOD file is an ASCII-based file format used to transfer models from

The Unscrambler® to compatible instruments and prediction software.

How to use it

The Unscrambler® ASCII-MOD file is an easy-to-read ASCII-based file format capable of

representing models created by The Unscrambler® and contains all information necessary

for prediction and classification.

The file format is used to transfer models to compatible instruments and prediction

software.

The files are saved with a .amo file name extension.

ASCII-MOD export dialog

215

The Unscrambler X Main

Select model

A drop-down list contains all models found in the currently open project. Select the

one to export.

Type

Choose between Full and Short prediction storage, where the second is used to

achieve smaller file size when only the regression coefficients are used for

prediction.

PCs

The number of Principal Components or factors to include in the exported model.

Y-Variable

Include the Y-variables to be included with the model.

Press OK and use the file dialog to select the destination directory and give a file name to

save the model.

File structure

An ASCII-MOD file contains all information necessary for prediction and classification.

The ASCII-MOD file is an easy-to-read ASCII file. The table below lists the matrices which are

found in the ASCII-MOD file, depending on the type of ASCII-MOD file and type of model.

When generating an ASCII-MOD file, one can choose between “Short” (referred to as “Mini”

in previous versions of the software) and “Full” storage. Matrices stored under these options

are indicated with ‘x’ in the table.

ASCII-MOD file matrices

Matrix name Short Full PCA Full Regr. Rows Columns

B0 x x PC (1-a) 1 row

ResXValTot x x PC (0-a)

216

Export

Table of result matrices:

RMSEP, SEP, Bias, Slope, Offset, Corr, SEPcorr, ICM-Slope, ICM-Offset

Note: The contents of the columns “Rows” and “Columns” shows the contents of

the ASCII-MOD file, not the contents of the matrices in the main model file.

TYPE=FULL // (MINI,FULL)

VERSION=1

MODELNAME=F:\U\EX\DATA\TUTBPCA.11D

MODELDATE=10/27/95 11:41:13

CREATOR=Joe Doe

METHOD=PCA // (PCA, PCR, PLS1, PLS2)

CALDATA=F:\U\EX\DATA\TUTB.00D

SAMPLES=28

217

The Unscrambler X Main

XVARS=16

YVARS=0

VALIDATION=LEVCORR // (NONE,LEVCORR,TESTSET,CROSS)

COMPONENTS=2

SUGGESTED=2

CENTERING=YES // (YES,NO)

CALSAMPLES=28

TESTSAMPLES=28

NUMCVS=0

NUMTRANS=2

TRD:DNO // ,,,,,,,complete transformation string

TRD:DSG // ,,,,,,,complete transformation string

NUMINSTRPAR=1

##GAIN=5.2

MATRICES=13

"xWeight" // (Name of 13 matrices)

"xCent"

"ResXValTot"

"ResXCalVar"

"ResXValVar"

"ResXCalSamp"

"Pax"

"Wax"

"SquSum"

"TaiCalSDev"

"xCalMean"

"xCalSDev"

"xCal"

%XvarNames

"Xvar1" "Xvar2" "Xvar3" "Xvar4"

"Xvar5" "Xvar6" "Xvar7" "Xvar8"

"Xvar9" "Xvar10" "Xvar11" "Xvar12"

"Xvar13" "Xvar14" "Xvar15" "Xvar16"

%xWeight 1 16

.1000000E+01 .1000000E+01 .1000000E+01 .1000000E+01 .1000000E+01

.1000000E+01 .1000000E+01 .1000000E+01 .1000000E+01 .1000000E+01

.1000000E+01 .1000000E+01 .1000000E+01 .1000000E+01 .1000000E+01

.1000000E+01

%xCent 1 16

.1677847E+01 .2258536E+01 .2231011E+01 .2404268E+01 .2179311E+01

.2470489E+01 .2079168E+01 .1734536E+01 .1475164E+01 .1480657E+01

.1644097E+01 .1805900E+01 .1980229E+01 .1795443E+01 .1622796E+01

.1497418E+01

,,,

,,,etc.

Description of fields

The below table lists the data field codes used in ASCII-MOD files.

Description of fields

Field Description

VERSION Increases by one for each changes of the file format after release

218

Export

Field Description

MODELDATE Date for creation of the model (not the ASCII-MOD file)

CREATOR Name of the user who made the model (not the ASCII-MOD file)

VALIDATION (TEST,LEV,CROSS)

SUGGESTED

ASCII-MOD file)

CENTERING (YES,NO)

INSTRUMENT

See below

PARAM.

Number of matrices on this file. One name for each matrix follows

MATRICES

below

Transformation strings

There is one line for each transformation. The format of the line will depend on type of

transformation. If a transformation needs more data which is the case for MSC, this extra

data will be stored as matrices at the end of the file. References to these matrices can be

done by names.

Examples

A transformation named TRANS using one parameter could look like this:

TRANS:TEMP=38.8;

A MSC transformation may look something like this:

MSC:VARS=19,SAMPS=23,MEAN="ResultMatrix18",TOT=" ResultMatrix19";

Transformation strings may also contain error status which is the case when the MSC-base

have been deleted from file before making the ASCII-MOD file.

219

The Unscrambler X Main

Transformation strings

Main Description Secondary Description

CLA Classification

PRE Prediction

STA Statistics

VAR Variable

VEC Vector

VAR Variable

IMP Import —

VAR Variable

REP Replace —

BAS Baseline

220

Export

RED Reduce

TSP Transpose

USR User-Defined

Storage of matrices

Each matrix starts with a header as in this example:

%Pax 10 155

Telling: Matrix name is Pax the matrix has the dimension 10 rows and 155 columns. From

the next line the data elements will follow in the following sequence:

Pax(1,8) Pax(1,9) , , , ,

Pax(1,xvars-1) Pax(1,xvars)

Pax(1,2) Pax(2,2) Pax(2,3) , , ,

, ,

Pax(comp,1) Pax(comp,2) , , , Pax(comp,xvars)

A missing value will simply be written as the character m.

If the calibration model was made using 1 Y variable, it uses PLS1, and if it was

created using >1 Y variable the AMO file uses PLS2.

6.3. ASCII

6.3.1 ASCII export

The ASCII export option is very useful if one wants to work with the data table in another

program.

How to use it

221

The Unscrambler X Main

Many other programs can read ASCII files. This export option therefore is very useful if one

wants to work with the data table in another program.

ASCII export dialog

Select the matrix and data ranges that make up the data to be exported, or use Define to

create a new range.

Options

Include headers

Specify sample names and variable names are to be exported by selecting them in

the Include headers field. They will be placed in the first column and in the first row,

respectively.

Name qualifier

String data, such as headers, may be quoted, using either double quotes ", or single

quotes '.

It is recommended to mark text with quotes and not mark numbers, because it

makes it easier for importing programs to assign correct data types to text and

numbers.

Default is ".

Numeric qualifier

Numeric data, may be quoted similar to headers.

222

Export

Default is None.

Item delimiter

Table cell entries may be delimited by different characters.

Default is ,.

String representation of missing data

Specify how missing data are to be coded in the ASCII file.

Default is m.

For compatibility with software that doesn’t have support for importing missing data

as strings, use a large negative number, such as -9.9730e+023 instead.

6.4. DeltaNu

6.4.1 DeltaNu

The DeltaNu file is a model file format developed for use with the DeltaNu Pharma-ID Raman

spectrometers. It contain all the necessary information for projection and classification. PCA

Models created in The Unscrambler� X can be exported to this file format. Such models are

compatible with DeltaNu Raman instrumentation for real-time projections.

The files are saved with a .dnub file name extension.

How to use it

To export a PCA model to the DeltaNu format, go to File- Export-DeltaNu.. and the following

dialog will appear.

DeltaNu export dialog

Select model

A drop-down list contains all models found in the currently open project. Select the

one to export. Only PCA models are supported in the DeltaNu format.

PCs

The number of Principal Components to include in the exported model. The default

value given is the optimal number of PCs for the model. It is recommended to export

a model with the optimal number of PCs. To export the model with a different

number of PCs use the drop-down list to choose a different number of PCs.

223

The Unscrambler X Main

Press OK and use the file dialog to select the destination directory and give a file name to

save the model.

6.5. JCampDX

6.5.1 JCAMP-DX export

How to use it

The JCAMP-DX format is read by many instrument software. This file format requires that

the X-part of the data have numerical names, e.g. wavelengths, wavenumbers, retention

times, etc.

JCAMP-DX export dialog: Select data

Select the matrix and data ranges that make up the data to be exported, or use Define to

create a new range.

Metadata

Then, in the File Info tab, enter information related to the JCAMP-DX file as a whole. Here

one must choose between two JCAMP-DX formats: XYPoints and XYData. XYData requires

that the distance between each variable is the same throughout the whole X-Variable Set.

XYData produces smaller file sizes than XYPoints.

JCAMP-DX export dialog: File info

224

Export

Title

Name of the data set

Origin

Can be the name of the lab, client name, batch number, or location where data

came from.

Owner

Name of the person conducting the experiment or the analysis.

Enter information related to the samples in the Samples Info tab. This information is saved

with each sample.

JCAMP-DX export dialog: Sample info

225

The Unscrambler X Main

Sample names

Select either Use sample name from data table or Use text to specify manually

Sampling procedure

Details on how the data was collected.

Data processing

List the transformations applied to prepare the data.

Data type

Select appropriate value from the drop-down list.

X units

Select appropriate value from the drop-down list.

Y units

Select appropriate value from the drop-down list.

Click OK to save the file.

6.6. Matlab

6.6.1 Matlab export

The Unscrambler® provides the capability to export data tables to Matlab including sample

names (row headings in The Unscrambler®) and variable names (column names in The

Unscrambler®).

How to use it

The Unscrambler® provides the capability to export data tables to Matlab including sample

names (row headings in The Unscrambler®) and variable names (column names in The

Unscrambler®).

Matlab export dialog

Select the matrix and data ranges that make up the data to be exported, or use Define to

create a new range.

226

Export

Options

Select whether sample and variable names should be exported. If this option is selected then

these names are stored in separate arrays within the export file as normally done in Matlab.

Select Use Compression to use gzip-compression for arrays stored to the Matlab file. This

will reduce the file size.

The exported data is saved as filename.mat, where “filename” represents the name

entered for the file on saving.

To load the converted file, type load filename in the Matlab command window. If the data

are exported without sample and variable names, the filename.mat file contains one

variable called “Matrix” that contains The Unscrambler® worksheet data.

Sample and variable names

If the data are exported with sample and variable names, the file contains 2

additional arrays: “ObjLabels” and “VarLabels”.

“ObjLabels” contains row (sample) names.

“VarLabels” contains are column (variable) names.

Both are character arrays.

Missing Value Conversion

Missing values in a worksheet in The Unscrambler® are converted to the number -

9.9730e+023.

Converting category variables

Category variables are converted into integers.

Note: The array names (“Matrix”, “VarLabels”, and “ObjLabels”) are the same in

each exported file from The Unscrambler®. Thus, if several converted files are

loaded into Matlab, rename the variables in Matlab after each load command or

they will be overwritten by subsequent import operations.

6.7. NetCDF

6.7.1 NetCDF export

How to use it

NetCDF (Network Common Data Format) is a set of software libraries and machine-

independent data formats that support the creation, access, and sharing of array-oriented

scientific data.

Upon choosing File – Export – NetCDF… an export dialog will open:

227

The Unscrambler X Main

create a new range.

Metadata

In the field Global Attributes, enter all other relevant details:

Data set origin

Can be the name of the lab, client name, batch number, or location where data

came from.

Equipment ID

Can be the product name, product number, serial number, or IP address of the

instrument used.

Equipment manufacturer

Name of the instrument vendor.

Equipment type

Type of instrument used, e.g. NIR.

Operator name

Name of the person conducting the experiment or the analysis.

Experiment date time

Date and time of the data collection. It is suggested to enter the date according to

the ISO 8601 standard, e.g. 2010-01-27T09:55:41+0100.

All attributes are optional. It is generally recommended to add metadata to files for better

file search results.

228

Export

6.8. UnscFileWriter

6.8.1 Export models to The Unscrambler® v9.8

The Unscrambler® 9.8 file is the previous file format and models in this format contain all

the necessary information for prediction and classification. Models (PCA, MLR, PCR and PLS)

created in The Unscrambler® X can be exported to this previous file format using the File

writer plug-in. Such models are compatible with OLUP and OLUC 9.8 software for real-time

classification and prediction.

How to use it

Model files (MLR, PCR, PLSR and PCA) can be exported to The Unscrambler® 9.8 format using

the File Writer plug in.

Some methods and features that were not available in Unscrambler® 9.8 cannot be

exported. These include:

Models registered with following pretreatments

Orthogonal Signal Correction (OSC)

Correlation Optimized Warping (COW)

Weights

Deresolve

Quantile Normalization

Basic ATR correction (Spectroscopic transformation)

Models with cross validation based on category variable

The following classification models

Linear Discriminant Analysis (LDA, PCA-LDA)

Support Vector Machine Classification (SVM-C)

SIMCA classification

Support Vector Machine Regression (SVM-R)

Prediction, classification or projection results from The Unscrambler® X

The Unscrambler® 9.x used the file name extensions listed below to distinguish between

different data and model types:

The Unscrambler® 9.x files File name extension

PCA .11M

MLR .40M

PLS1 .41M

PLS2 .42M

PCR .43M

229

The Unscrambler X Main

Unscrambler export dialog

Available models

A drop-down list contains all models found in the currently open project that can be

exported. Select the one to export.

Model Information

This contains details about the model selected

Notes

The time the chosen model was created is given here, along with any other

information that has been added to the Notes section of the chosen model. Users

may also add additional information in the Notes section, which will be available in

the exported model.

Save model with components

Use the components box to select the correct number of components for saving the

model in 9.8 format. The set number of components for the model will be displayed

and used by default.

Save as micro model

The check box allows user to save the model in 9.8 micro format.

Press OK and use the file dialog to select the destination directory and give a file name to

save the model.

230

7. Plots

7.1. Line plot

A line plot displays a single series of numerical values with a label for each element. The plot

has two axes:

The horizontal axis shows the labels, in the same physical order as they are stored in

the source file;

The vertical axis shows the scale for the plotted numerical values.

As a Curve

A curve linking the successive points is more relevant to study a profile, and if the

labels displayed on the horizontal axis are ordered in some way (e.g. PC1, PC2, PC3).

Line Plot: Curve display for following a batch evolution

With Symbols

Symbols produce the same visual impression as a 2-D scatter plot (see Scatter Plot),

and are therefore not recommended.

Line plot: symbol display

231

The Unscrambler X Main

Several series of values which share the same labels can be displayed on the same line plot.

The series are then distinguished by means of colors.

Line plot: 2 series with curve display

A bar plot displays a single series of numerical values with a label for each element. The plot

has two axes:

The horizontal axis shows the labels, in the same physical order as they are stored in

the source file;

The vertical axis shows the scale for the plotted numerical values.

Bar plot of a series

232

Plots

Several series of values which share the same labels can be displayed on the same bar plot.

The series are then distinguished by means of colors, and an additional layout is possible:

accumulated or stacked bars. Accumulated bars are relevant if the sum of the values for

series1, series2, etc. has a concrete meaning (e.g. total production or composition).

Two layouts of a bar plot for two series of values: Bars and Accumulated Bars

233

The Unscrambler X Main

A 2-D scatter plot displays two series of values which are related to common elements. The

values are shown indirectly, as the coordinates of points in a 2-dimensional space: one point

per element.

As opposed to the line plot, where the individual elements are identified by means of a label

along one of the axes, both axes of the 2-D scatter plot are used for displaying a numerical

scale (one for each series of values), and the labels may appear beside each point.

234

Plots

A regression line visualizing the relationship between the two series of values

235

The Unscrambler X Main

Plot statistics, including among others the slope and offset of the regression line

(even if the line itself is not displayed) and the correlation coefficient.

A 3-D scatter plot displays three series of values which are related to common elements. The

values are shown indirectly, as the coordinates of points in a 3-dimensional space: one point

per element.

A 3-D scatter plot

236

Plots

All the plots can be customized. This is done from the properties dialog which is accessed by

a right click on the plot and the selection of the Properties menu,

When selecting the Properties menu, the Plot properties dialog appears.

Each of the following items can be modified:

Axis X, its gridlines and axis labels

The visibility, the title with its font and position, the scale - both its appearance

(logarithmic or reversed) and its labels - and origin can be modified on the X axis.

The axis label rotation can also be set in this menu.

Properties Axis X

237

The Unscrambler X Main

Access to the same possibilities as the Axis X and its gridlines.

Appearance

Four different items can be customized from this menu and its sub-menu:

Background

Header: title, color, font, visibility, color of the background

Legend: title, color, font, visibility, color of the background

Plot Area: Chart area, color, font, visibility, borders, surface

Properties Appearance

238

Plots

For the Header and Legend the text can be edited. One can customize the name,

such as only having part of the name displayed, the font and the color.

Properties Header

Graphic Objects

It is possible to include some graphical objects in the plot such as line, arrow,

rectangle, ellipse and text. Each of those objects can be configured in terms of color,

thickness and font if necessary.

3-D scatter plots can be enhanced by:

Addition of vertical lines

They “anchor” the points and can facilitate the interpretation of the plot.

A 3-D Scatter plot displayed with anchors

239

The Unscrambler X Main

To add vertical lines, click on More (see section below on Additional Options).

Rotation

The plot can be rotated so as to show the relative positions of the points from a

more relevant angle; this can help detect clusters. Click on the plot and move it with

the cursor in the appropriate direction.

A 3-D Scatter plot after rotation

240

Plots

The axes can be interchanged in plot, using the arrows on the toolbar. If more than three

columns are selected, the axes can be changed from the drop-down lists next to the axis

arrows on the toolbar.

Additional options

Click on More to access more options for 3D scatter plots.

Scroll through the

Gallery

Data

3D-View

options to customise the appearance of 3D Scatter Plots. These features are described in the

following,

3D Scatter plot gallery

Select from the gallery of plots to obtain the desired appearance of the plot.

3-D Scatter plot data

241

The Unscrambler X Main

3-D Scatter plot 3-D view properties dialog

The rotation, perspective, and axis scales can be changed under the 3-D view tab.

242

Plots

The matrix or surface plot can be seen as the 3-dimensional equivalent of a line plot to

display a whole table of numerical values with a label for each element along the 2

dimensions of the table. The plot has up to three axes:

The first two show the labels, in the same physical order as they are stored in the

source file;

The vertical axis shows the scale for the plotted numerical values.

Depending on the layout, the third axis may be replaced by a color code indicating a range of

values (contour plot), thus making the surface plot essentially a contour plot or a map plot

when looking at it straight from above. The layout can be changed by right clicking on the

plot, and selecting Plot type for a shortcut to predefined layouts, or select Properties to

customize 3-D plots, and make changes to the axes, legends, etc..

The Plot type submenu

The points can either be represented individually, or summarized according to one of the

following layouts:

Surface

It shows the table as a 3-D landscape.

Matrix plot with a landscape display

Contour

The contour plot has only two axes. A few discrete levels are selected, and points

(actual or interpolated) with exactly those values are shown as a contour line. It

looks like a geographical map with altitude lines;

Matrix plot with a contour display

243

The Unscrambler X Main

This option is accessible from Plot type – Contour, or the Properties of the plot:

Surface plot menu

Map

On a map, each point of the table is represented by a small colored square, the color

depending on the range of the individual value. The result is a completely colored

rectangle, where zones sharing close values are easy to detect. The plot looks a bit

like an infrared picture.

244

Plots

This option is accessible from Plot type – Map, or the Properties of the plot, the

option is Scatter chart, zoned, 2D projection.

Scatter plot menu

Bars

This option gives roughly the same visual impression as the landscape plot if there

are many points, otherwise the “surface” appears more rugged.

Matrix plot with a 3-D bar display

245

The Unscrambler X Main

Bar plot menu

3-D-Scatter is also accessible via this Properties menu, see 3-D scatter plot for help on that

plot.

246

Plots

A histogram summarizes a series of numbers without actually showing any of the original

elements. The values are divided into ranges (or “bins”), and the elements within each bin

are counted.

The plot displays the ranges of values along the horizontal axis, and the number of elements

as a vertical bar for each bin.

Histograms are used to plot the data distribution, and often for density estimation:

estimating the probability density function of the underlying variable. The total area of a

histogram used for probability density is always normalized to 1. If the length of the intervals

on the x-axis are all 1, then a histogram is identical to a relative frequency plot.

A statistics table can be added to the plot by clicking the button. This will print the

number of data elements as well as the distribution statistics Skewness (i.e. asymmetry),

Kurtosis (i.e. flatness), Mean, Variance and the Standard Deviation (SDev).

It is possible to redefine the number of bins, to improve or reduce the smoothness of the

histogram, using the drop-down list Bars.

247

The Unscrambler X Main

The histogram is one of the seven basic tools of quality control, which also include the

Pareto chart, check sheet, control chart, cause-and-effect diagram, flowchart, and scatter

diagram.

The normal probability plot is a graphical technique for normality testing: assessing whether

or not a data set is approximately normally distributed.

The data are plotted against a theoretical normal distribution in such a way that the points

should form an approximate straight line. Departures from this straight line indicate

departures from normality. Each element of the series is represented by a point. A label can

be displayed beside each point to identify the elements.

This type of plot enables a visual check of the probability distribution of the values.

Normal distribution

248

Plots

If the points are close to a straight line, the distribution is approximately normal

(Gaussian).

Normal probability plot showing a series following a Normal distribution

If most points are close to a straight line but a few extreme values (low or high) are

far away from the line, these points are outliers. In the example below sample 50

looks like an outlier.

Normal probability plot showing a series following Normal distribution with an

outlier

If the points are not close to a straight line, but determine another type of curve, or

clusters, the distribution is not normal.

249

The Unscrambler X Main

This plot displays several scatter plots. A maximum of five variables at a time are used and

scatter plots for each pair of variables are shown above the diagonal. The variables are

indicated on the diagonal and can be changed from the list.

Multiple scatter plot structure

Variable 1 Variable 2 Variable 3

Name of variable 1

1 Variable 1 and 2 Variable 1 and 3

Name of variable 2

2 1 and 2 Variable 2 and 3

Name of variable 3

3 1 and 3 and 3

The colors of the panels on the lower diagonal are an indicator of the correlation. Positive

correlation is indicated in shades of blue while negative values are shown in shades of red.

This plot helps in quickly identifying relationships between variables and allows one to

choose variables to examine in greater detail.

It is specially useful to detect which variables are responsible for a discrimination of sample

groups for example.

Access the Multiple Scatter plot from the menu Plot - Multiple Scatter

Plot - Multiple Scatter menu

250

Plots

Multiple Scatter plot Scope

Once the variables are selected, click OK and the plot will appear in the viewer.

Multiple scatter plot

If more than four variables have been selected for the multiple scatter plot, others can be

displayed by choosing them from the drop-down list on the diagonal of the plots.

Variable drop-down list menu

251

The Unscrambler X Main

A table plot is nothing more than results arranged in a tabular format, displayed in a

graphical interface which optionally allows for resizing and sorting the columns of the table.

Although it is not a “plot” as such, it allows tabulated results to be displayed in the same

viewer system as other plots.

Example of table plot: Table of Correlation

A few analysis results require this format, because it is the only way to get an

interpretable summary of complex results. A typical example is Analysis of Variance

(ANOVA); some of its individual results can be plotted separately as line plots, but

the only way to get a full overview is to study 4 or 5 columns of the table

simultaneously.

Standard graphical plots like line plots, 2-D scatter plots, matrix plots, etc. can be

displayed numerically to facilitate the exportation of the underlying numbers to

another graphical package, or a worksheet.

To do so, use the option View Numerical accessible in two ways: from a right click

on the plot and from the View menu.

View Numerical option from a Right click on the plot and from the View menu

252

Plots

This is an ad-hoc category which groups all plots that do not fit into any of the other

descriptions.

Some of these plots are an adaptation of existing plot types, with an additional

enhancement, while other plots have been developed to answer specific needs.

Mean and standard deviation plot

For instance, “Means” can be displayed as a line plot. However to include standard

deviations (SDev) into the same plot which is quite useful, the most relevant way to do so is

to:

and display SDev as an error bar on top of the Mean vertical bar.

This is what has been done in the special plot “Mean and SDev”.

Special plot: Mean and SDev

253

The Unscrambler X Main

This plot presents the level of a design variable that have significantly different effects on a

response variable in a graphical way which gives an immediate overview.

Special plot: Multiple Comparisons

The Predicted with deviation plot shows the predicted value as well as the possible

deviation. It gives a direct answer to the level of trust to have on the results. The deviations

are estimated as a function of the global model error, the sample leverage, and the sample

residual X-variance. A large deviation indicates that the sample used for prediction is not

similar to the samples used to make the calibration model. This is a prediction outlier: check

its values for the X-variables.

Special plot: Predicted with deviation

254

Plots

In order to compare different results it can be useful to plot them in the same plot instead of

two separate plots.

Two separate plots

255

The Unscrambler X Main

256

Plots

Access to Add Data…

To be able to add data to a plot it is necessary to access to the Add Data… menu. This is

available when creating a custom layout. Begin by going to Insert - Custom Layout. When a

plot is displayed after formatting the custom layout, the Add Data option is accessible from

a right click on a plot displayed in the workspace.

Access Add Data… menu

The following dialog box opens.

Add Data… dialog box

257

The Unscrambler X Main

Matrix

Use the drop-down list if the data are in a data matrix and use the select result

matrix button if the data are in an analysis result.

Rows and Cols

Use the drop-down list if the subset is already defined and use the Define button if it

has to be defined.

It is possible to customize a plot by adding text, lines and drawings to it.

To do this use the Draw toolbar:

258

Plots

In order to remove drawing objects from plots, you can use either the Edit - Undo option (or

toolbar button), or you can select the drawing object using the mouse pointer and click the

keyboard Delete button.

In an interactive analysis it can be very useful to mark some samples in e.g. a Scores plot to

create a new range. To do so, right click on the plot with the marked samples and select the

option Create Range

Create Range Dialog

259

The Unscrambler X Main

Sample Selection : Select whether the marked or unmarked samples (or both)

should be extracted from the model, and give the ranges informative names. By

default the marked and unmarked sample ranges will be named Outliers and Good

Samples, respectively.

Create Range : The new range will be created based on one or more data tables

available in the project navigator. All data tables with the correct number of rows

will be listed in this frame. Use the radio buttons to define whether a new data table

should be created or if the ranges should be added to existing tables. As an

additional quality control it is possible to list only data tables with matching sample

names. A yellow warning sign next to a table indicates that the sample names are

missing or non-matching.

Mean and standard deviation, PCA scores, regression coefficients: all these results from

various types of analyses are originally expressed as numbers. Their numerical values are

useful, e.g. to compute predicted response values. However, numbers are seldom easy to

interpret as such.

Furthermore, the purpose of most of the methods implemented in The Unscrambler® is to

convert numerical data into information. It would be a pity if numbers were the only way to

express this information!

Thus visualization tools are provided for representation of the main results of the methods

available in The Unscrambler®. The best way, the most concrete, the one which will helps

one to get a real feeling for results, is the following:

A plot!

Most often, a well-chosen picture conveys a message faster and more efficiently than a long

sentence, or a series of numbers. This also applies to raw data – displaying them in a smart

graphical way is already a big step towards understanding the information contained in

numerical data.

However, there are many different ways to plot the same numbers! The trick is to use the

most relevant one in each situation, so that the information which matters most is

emphasized by the graphical representation of the results.

Numbers arranged in a series or a table can have various types of relationships with each

other, or be related to external elements which are not explicitly represented by the

numbers themselves. Plotting is a way of seeing the structure. The chosen plot has to reflect

this internal organization, so as to give an insight into the structure and meaning of the

numerical results.

According to the possible cases of internal relationships between the series of numbers, The

Unscrambler® provides seven main types of plots for graphical representation of data:

Line plot

Bar plot

Scatter plot

3-D scatter plot

Matrix plot

Histograms

260

Plots

Multiple scatter plot

In addition, to cover a few special cases, two more kinds of representations are provided:

Table plot

Special plot

Formatting plot appearance

Adding text and drawings

Grouping samples

Plotting results from several matrices

Saving and copying a plot

A plot displays some information as points, bars or lines. Those items are displayed

accordingly to their coordinates and values.

It is possible to access this information by pointing at the item. It is also possible to mark the

item for further use.

Specific plots for each analysis

When performing an analysis there are some plots that will summarize the information

better than others.

In The Unscrambler® there is a list of predefined plots for each analysis. This list can be

accessed through one of the following:

Navigator

A shortcut to the most important plots can be given in the Plots sub-node of a model

in the project navigaor. The plots are displayed if the right-click model menu option

‘Show Plots’ is toggled on, and can be hidden by using the ‘Hide Plots’ option.

Plot node under a PCA analysis in the navigator

261

The Unscrambler X Main

The plot menu changes for each analysis, providing an extensive list of the available

plots.

Plot menu specific to the PCA analysis

The plot menu there is called by the name of the method for example PCA, it

provides the full list of available plots.

Plot menu from a right click on a plot from a PCA analysis

262

Plots

Interpreting plots

To get specific information on all the available plot for each analysis, see the specific Plot

sections under respective methods.

Design of Experiments

Descriptive statistics

Statistical tests

Principal Component Analysis (PCA)

Multiple Linear Regression (MLR)

Principal Components Regression (PCR)

Partial Least Squares Regression (PLS)

L-shaped PLS Regression (L-PLS)

Multivariate Curve Resolution (MCR)

Cluster analysis

Projection

SIMCA

Prediction

The objective with this function is to select subsets of samples to evenly cover the

multivariate space, as originally described by Kennard and Stone 1969. The starting point for

this option is a score plot. This document describes the functionality of the Kennard-Stone

Sample Selection dialog as implemented in The Unscrambler® X.

User Dialog

The user dialog is found by right clicking in a score plot from PCA, PCR or PLS

regression, and then under the option Mark select Kennard-Stone Sample Selection.

263

The Unscrambler X Main

It is also possible to enter the dialog from the icon in the Mark Toolbar

Kennard-Stone Sample Selecton

Function Description of Functionality

Number of Number of calibration samples to select with the K-S algorithm. The

samples default is 15.

Number of Here the number of components to use for selection is given. The

components default is the optimal number as found in the model.

Pre-Select

When selected any marked samples in the score plot will be included

samples - Include

in the calibration sample set in addition to what is identified with the

already marked

K-S Sample selection.

samples

Pre-Select Opens the Select samples dialog window for selecting samples to be

samples - included in the calibration sample set from the data matrix.

264

Plots

Manually pre-

select samples

Select validation

calibration samples will be created as a validation set using the

samples

Double Kennard-Stone sample selection algorithm.

Works only for PCR and PLSR models, when checked the initial

calibration set from K-S will be augmented with samples to produce

Augment set with a more uniform distribution of response values. Additional options

boxcar samples are available for setting number of bins for boxcar samples and

number of samples to select from the sample selection. This option

will be disabled if Select validation samples is checked.

Create row set as When selected the samples will be extracted into a new matrix, with

new matrix KS-Calibration and optionally KS-Validation row sets added.

Create row set in When selected, Calibration and optionally Validation row sets will be

selected matrix(es) added to selected, matching matrices.

Allow mis- While not checked, only matrices with identical sample names in the

matching samples same order will be listed. An exclamation mark is shown for the

names matrices where the sample names do not match.

The figure below shows the score plot after specifying 15 samples for calibration and

validation. The calibration samples are marked with green rectangles and the validation

samples with orange triangles.

The score plot with marked calibration and validation samples

When the option to create the sample set in selected Matrices is chosen, the matrices will

be added in the project navigator as shown below:

265

The Unscrambler X Main

If the option to Create row set as new matrix has been chosen, a matrix with the name of

the X matrix from the scores plot will be created with KS appended to the matrix name.

7.16. Marking

It is often useful to mark some samples or variables in a plot to:

Recalculate with modification on those samples or variables (Downweight, exclude,

include only)

There are several toolbar buttons available to mark a sample or a variable in a plot. The

Mark functions can also be accessed from the Edit - Mark menu, or by right-clicking in a plot

and selecting Mark

TheEdit - Mark* menu*

One by one

This option enables one to use the cursor to select an item to mark by clicking on it.

Rectangular

This option allows several grouped samples to be selected at the same time. The

cursor is transformed into a pointer that will allow the user to define the top left

corner and the bottom right corner of the rectangle.

Samples marked with rectangle option

266

Plots

The different types of Markings can be accessed from Edit-Mark.. or from toolbar shortcuts.

Lasso

This option activates the cursor to be used to define a special area. All samples

inside the area will be marked. To define the area click on the contour of the area to

be defined and maintain the click while defining the contour of the area. When the

click is released the selection is done.

Samples marked with lasso

Automatically mark samples uniformly throughout the data.

For more information see the Select evenly distributed samples documentation.

Kennard-Stone Sample Selection…

Automatically mark representative samples using the Kennard-Stone sample

selection algorithm, or use the double Kennard-Stone to extract both calibration and

validation samples.

For more information see the Kennard-Stone sample selection documentation.

267

The Unscrambler X Main

This option is available only if:

Uncertainty test was enabled.

Mark outliers

Add outliers to the current selection. These outliers are based on the warning limits

associated with a given analysis on the Warning Limits tab.

Unmark all

This option is used to remove a previous selection.

Reverse marking

When some items are selected in a plot and one would like to select the unselected

items, i.e. invert the current selection, the button Reverse marking can be used.

7.16.2 How to create a new range of samples or variables from the marked items

Once some samples / variables are selected in a plot it is possible to create a new range

including them. To do so right click on the plot with the selected items and select the option

Create Range.

Menu create range

For all raw data plots and for model plots of variables (e.g. PCA loadings), the new range

appears under the corresponding data table node with the default name “RowRange” or

“ColumnRange”.

New range created

268

Plots

When a sample range is created from within a model scores plot, a dialog is opened to allow

sample extraction into a new or existing data table. See the extract samples documentation

for details.

Once some samples / variables are selected in a plot it is possible to perform a new analysis

based on the same parameters as previously used, including a modification affecting the

selected samples or/and variables.

Select the analysis in the project navigator and right click. Select the Recalculate option.

Menu recalculate

With Marked…

This option allows the user to perform recalculation using the marked/selected

samples or variables for further analysis, the rest are kept out.

Without Marked…

The marked samples or/and variables are not included in the analysis, the

unselected samples or/and variables are.

269

The Unscrambler X Main

The marked variables are downweighted. See more information about downweight.

The other variables keep their original weight.

With UnMarked Downweighted…

The unmarked variables are downweighted. See more information about

downweight. The other variables keep their original weight.

With New Data

Additional data can be added to an analysis using this option. This will open a new

dialog from which the new data are selected. These new data can be appended to

the original data or original data in the matrix can be overwritten for the new

analysis.

Add data set

In addition to the general information available about the whole plot, one may also display

specific details regarding one particular point. This is done as follows:

Rest the cursor close to a data point: the point number is displayed.

Click on the point: a small box containing point number, point name and point

coordinates is displayed as shown in the figure below.

Point details

270

Plots

All the plots can be customized. This is done from the properties dialog which is accessed by

a right click on the plot and the selection of the Properties menu,

When selecting the Properties menu, the Plot properties dialog appears.

Each of the following items can be modified:

Axis X and its gridlines

The visibility, the title with its font and position, the scale - both its appearance

(logarithmic or reversed) and its labels - and origin can be modified on the X axis.

The axis label rotation can also be set in this menu.

Properties Axis X

271

The Unscrambler X Main

Access to the same possibilities as the Axis X and its gridlines.

Appearance

Five different items can be customized from this menu:

Background

Header: title, color, font, visibility, color of the background

Legend: title, color, font, visibility, color of the background

Point Label: color, font, visibility

Axis Label: title, color, font, visibility, borders

Properties Appearance

For the Point Label and Axis Label the text can be edited. One can customize the

name, such as only having part of the name displayed. For this option use the drop-

down list in Label layout - Show.

Properties: Point Label

272

Plots

Graphic Objects

It is possible to include some graphical objects in the plot such as line, arrow,

rectangle, ellipse and text. Each of those objects can be configured in terms of color,

thickness and font if necessary.

Properties Appearance

Chart properties

It is possible to further customize the chart properties by selecting More, which will

open up the Chart properties dialogue. Here one can define simple or complex chart

types from the options in the chart gallery. Further selection of chart properties can

be made, and the chart previewed.

Chart Properties

273

The Unscrambler X Main

All the plots can be customized. This is done from the properties dialog which is accessed by

a right click on the plot and the selection of the Properties menu,

When selecting the Properties menu, the Plot properties dialog appears.

Each of the following items can be modified:

Axis X, its gridlines and axis labels

The visibility, the title with its font and position, the scale - both its appearance

(logarithmic or reversed) and its labels - and origin can be modified on the X axis.

The axis label rotation can also be set in this menu.

Properties Axis X

274

Plots

Access to the same possibilities as the Axis X and its gridlines.

Appearance

Three different items can be customized from this menu:

Background

Header: title, color, font, visibility, color of the background

Legend: title, color, font, visibility, color of the background

Plot Area: Chart area, color, font, visibility, borders, surface

Properties Appearance

For the Header and Legend the text can be edited. One can customize the name,

such as only having part of the name displayed, the font and the color.

275

The Unscrambler X Main

Graphic Objects

It is possible to include some graphical objects in the plot such as line, arrow,

rectangle, ellipse and text. Each of those objects can be configured in terms of color,

thickness and font if necessary.

Properties Graphic Objects

Chart properties

It is possible to further customize the chart properties by selecting More, which will

open up the 3D Chart properties dialogue. Here one can define the chart types from

the options in the chart gallery.

Chart Properties

276

Plots

Additional options of a 3-D plot can be changed from the tab in the properties dialog. In the

Data tab, the layout of the data can be changed.

3-D Scatter plot data properties dialog

The rotation, perspective, and axis scales can be changed under the 3-D view tab.

3-D Scatter plot 3-D view properties dialog

277

The Unscrambler X Main

This dialog opens when clicking on the predefined plot “Response Surface” or when clicking

in the Plot - Response Surface menu when regression results are opened.

It contains four fields:

Y Variable

This is the response variable to be plotted. Use the drop-down list to select one.

Factor

This is only for PLS and PCR but not for MLR. Select the optimal number of factors to

be used. This affects the Beta coefficients and thus the response surface.

X Variable - 1

The predictor variable to be used in the first direction.

X Variable - 2

The predictor variable to be used in the second direction.

278

Plots

Access Save Plot… menu

A plot can be saved from the Save Plot… menu. It is accessible from a right click on a plot

displayed in the workspace.

Save Plot… menu

The following dialog box opens.

Save As… dialog box

Select where the plot should be stored in the field Save in.

279

The Unscrambler X Main

Enter a name for the plot in the field File name and select a format.

Types of format

There are six possible graphics file formats available for compatibility with many needs:

EMF

Use the EMF format which is vector graphics whenever possible. Vector graphics can

be scaled and will give the best quality.

Compatibility: EMF support is often limited to Microsoft applications. When sending

the plot graphics file for instance by email, a recipient may encounter problems

viewing and reusing it.

PNG

The second choice is PNG, which is raster graphics, and does not look as good when

enlarged.

This format is most suitable for web publishing and email.

This will generally result in smaller files than the following formats.

Compatibility: 5-10 year old applications may not support this image format.

Select one of the above formats. The following formats are also raster graphics, each having

it’s limitations. Included only for compatibility.

GIF

Limited to 256 colors.

JPEG

Lossy compression that will give artifacts. (JPEG is best suited for photographic

images.)

TIFF

Will produce larger files.

BMP

Will produce larger files.

Available image formats

It is possible to copy either one plot or all plots displayed in the workspace.

Copy one plot

The Copy menu is available from two places:

From right click on a plot

Right click on a plot and select Copy.

Copy from right click

280

Plots

Go to the Edit menu and select Copy.

Copy from Edit menu

The shortcut Ctrl+C is a fast way to copy a plot.

Copy all plots

The Copy All menu is also accessible from a right click on a plot displayed in the workspace.

After pasting, the plots that were displayed on the workspace will be shown without

borders.

281

The Unscrambler X Main

Pasting plots

Depending on the application to be used there may be different options such as the shortcut

Ctrl+V or from an Edit menu.

When creating a plot, it is necessary to define the scope of the plot in terms of:

Samples (row range),

Variables (column range).

A common dialog appears when selecting any of the plotting options from Plot:

Line

Bar

3D Scatter

Matrix

Histogram

Normal Probability

Multiple Scatter

Define the row and column ranges from predefined ranges using the drop-down list.

To use new ranges, click on icon that looks like a matrix to access a matrix from the project

navigator and on Define to access the Define Rangeramework\menu2-edit\range.htm)

dialog.

Plot scope dialog

282

Plots

To use data that are part of a results matrix, use the select result matrix button to

choose the desired results matrix.

This tool allows users to automatically select a representative subset of the samples in any

plot of samples. The selection can be used to create a range.

Evenly distributed samples dialog

Min/Max

Selects the samples most separated in the data set.

A number of extreme samples will be picked out for each PC, according to the

specification in the right column in the table below the method choice. It will be

labeled Number of min/max, and for each min/max selected, two extreme samples

are marked (max and min value). Thus, setting the number to 2 will mark a total of

four samples.

Classes

The samples will be divided into a number of classes for each PC. One pair of

extreme samples (max and min value) will be picked out for each PC, according to a

user’s specification in the right column in the list below the Methods field. It will be

283

The Unscrambler X Main

labeled Number of classes, and for each class, two extreme samples are marked.

Thus, setting the number to 2 will mark a total of four samples.

Then, in the list below the method choice, specify the number of PCs (listed in the left

column) for which to mark samples, and how many (listed in the right column). No samples

are marked for PCs with 0 in the right column, i.e., in the above figure, only PC 1 is marked.

When a plot is displayed in the view pane, it is possible to modify this view by several scaling

options:

Full-screen

To view a plot in full-screen mode select it by clicking on it and use the Full-screen

button .

The plot will be expanded in full-screen mode. To come back to the usual view in the

view pane, right click on the expanded plot.

Zoom-in

To zoom in a displayed plot, the zoom-in being down in the center area, there are

two options:

Zoom-out

To zoom out a displayed plot, the zoom-out being down from the center area, there

are two options:

Frame-scale

To zoom in a special area it is more convenient to define the area to zoom-in with a

rectangle. To access this functionality use the Frame-scale button .

A cross will appear, which is to be used to define the area to zoom into. A dotted

rectangle will appear around the defined frame and when releasing, the zoom will

be performed.

Defining the frame to zoom-in

284

Plots

Move

It is possible to move inside the plot itself. To do so use the keyboard: Ctrl+Shift.

Auto-scale

To come back to the original view of the plot defined by The Unscrambler® use the

Auto-scale button

For Matrix and 3D-Scatter there are two ways to zoom-in:

Using the mouse wheel, will zoom the points and bars within the cube

Using Ctrl+Left mouse drag up and down, will zoom the cube itself

From the viewer one can drag the four-pin view to other sizes by choosing the center + sign

to view.

285

8. Design of Experiments

8.1. Experimental design

Experimental design is a strategy to gather empirical knowledge, i.e. knowledge based on

the analysis of experimental data and not on theoretical models. It can be applied when

investigating a phenomenon in order to gain understanding or improve performance.

Building a design means carefully choosing a small number of experiments that are to be

performed under controlled conditions.

Learn about the concepts and methods of experimental design in the Introduction to Design

of Experiments section.

Learn how to use the Design of Experiments tools offered by The Unscrambler®:

Modify or extend an existing design using Tools – Modify/Extend Design…

Analyze the experimental results using Tasks – Analyze – Analyze Design Matrix…

Interpret the analytical results

The aim of multivariate data analysis is to extract the maximum amount of information from

a data table. The data can be collected from various sources or designed with a specific

purpose in mind.

DoE basics

Why use experimental design?

What is experimental design?

Investigation stages and design objectives

Screening

Factor Influence Study

Optimization

Available designs in The Unscrambler®

Types of variables in experimental design

Design vs. non-design variables

Continuous vs. category variables

Mixture variables

Process variables

Designs for unconstrained screening situations

Full-factorial designs

Fractional-factorial designs

Plackett-Burman designs

Designs for unconstrained optimization situations

Central composite designs

Box-Behnken designs

Designs for constrained situations

Mixture designs

Axial designs: Screening of mixture components

Simplex-centroid designs: Optimization of mixtures

287

The Unscrambler X Main

D-optimal designs

Designs with simple linear constraints

Non-simplex mixture designs

Process/mixture designs

Types of samples in experimental design

Sample order in a design

Blocking

Extending a design

Building an efficient experimental strategy

Analyze results from designed experiments

Simple data checks and graphical analysis

Analysis Of Variance (ANOVA)

Checking the adequacy of the model

Analysis of effects using classical methods

Response surface analysis using classical methods

Limitations of ANOVA

Analysis with PLS Regression

When data are missing or experimental conditions have not been reached

Advanced topics for unconstrained situations

Advanced topics for constrained situations

Why use experimental design?

When collecting new data for multivariate modeling, one should pay attention to the

following criteria:

Focusing: collect only the information that is really needed.

Obtain historical data (from a database, from plant records, etc.). However such

data may be biased by changes occurring during the period between acquisition and

analysis. It is anyhow a good start to get some general trends and ideas.

Collect new data: record measurements directly from the production line, for

example, make observations in fish farms, process development lab, formulation

lab, etc. This will ensure that the data apply to the system being studied today (not

another system, three years ago). However most processes tend to be kept under

tight control and variation is minimal. This may lead to problems finding enough

variability to develop a model.

Run specific experiments by disturbing (exciting) the system being studied. Thus the

data will encompass more variation than is to be naturally expected in a stable

system running as usual.

Design experiments in a structured, mathematical way. By choosing symmetrical

ranges of variation and applying this variation in a balanced way among the

variables being studied, one will end up with data where effects can be studied in a

288

Design of Experiments

simple and powerful way. With designed experiments there is a better possibility of

testing the significance of the effects and the relevance of the whole model.

data analysis because it generates “structured” data tables, i.e. data tables that contain an

important amount of structured variation. This underlying structure will then be used as a

basis for multivariate modeling, which will guarantee stable and robust models.

More generally, careful sample selection increases the chances of extracting useful

information from the data. When one has the possibility to actively perturb the system

(experiment with the variables), these chances become even greater. The critical part is to

decide which variables to change, the intervals for this variation, and the pattern of the

experimental points.

What is experimental design?

Experimental design is a strategy to gather empirical knowledge, i.e. knowledge based on

the analysis of experimental data and not on theoretical models. It can be applied when

investigating a phenomenon in order to gain understanding or improve performance.

Building a design means carefully choosing a small number of experiments that are to be

performed under controlled conditions. There are four interrelated steps in building a

design:

Define the objective of the investigation: e.g. “better understand” or “sort out

important variables” or “find the optimum conditions”.

Define the variables that will be controlled during the experiment (design variables),

and their levels or ranges of variation.

Define the variables that will be measured to describe the outcome of the

experimental runs (response variables), and examine their precision.

Choose among the available standard designs the one that is compatible with the

objective, number of design variables and precision of measurements, and has a

reasonable cost.

Most of the standard experimental designs can be generated in The Unscrambler® once the

experimental objective, the number (and nature) of the design variables, the nature of the

responses and the economical number of experimental runs have been defined. Generating

such a design will provide the user with the list of all experiments to be performed in order

to gather the required information to meet the objectives.

Depending on the stage of the investigation, the amount of information to be collected and

the resources that are available to achieve the goal, it is important to choose an adequate

design among those available in The Unscrambler®. The following describes the most

common standard designs for dealing with the various data types and situations described

above.

Screening

When starting a new investigation or a new product development, there is usually a large

number of potentially important variables. At this stage, the main objective of the

experimental work is to find out which are the most important variables. This is achieved by

including many variables in the design, and roughly estimating the effect of each design

289

The Unscrambler X Main

variable on the responses with the help of a screening design. The variables which have

“large” effects can be considered as important. The isolated effects of single variables are

known as main effects and the purpose of screening designs is to isolate these only. There

are several ways to judge the importance of a main effect, for instance significance testing or

use of a normal probability plot of effects.

Some screening designs are capable of estimating interaction effects. These occur when the

effect of changing one variable depends on the level of other variables in the study. Some

variables may be important even though they do not seem to have an impact on the

response by themselves. The reason is that the presence of interaction effects may mask

otherwise significant main effects.

Models for screening designs

The user must choose the adequate form of the model that relates response variations to

variations in the design variables. This will depend on how precisely one wants to screen the

potentially influential variables and describe how they affect the responses. The

Unscrambler® contains two standard choices:

The simplest form is a linear model. Choosing a linear model will allow one to

investigate main effects only with possible check for curvature effect;

To study the possible interactions between several design variables, one will have to

include interaction effects in the model in addition to the linear effects.

When building a mixture or D-optimal design, one must choose a model form explicitly,

because the adequate type of design depends on this choice. For other types of designs, the

model choice is implicit in the design that has been selected.

Factor Influence Study

After an initial screening design has been performed and a number of important variables

have been isolated, a Factor Influence study can be performed using full factorial, or high

resolution fractional factorial designs. These are used to further study the main effects of

the variables, but also, they are used to investigate interactions of various orders: two factor

interactions involve two design variables, three factor interactions involve three variables

etc. The importance of an interaction can be assessed with the same tools as for main

effects.

Design variables that have an important main effect are important variables. Variables that

participate in an important interaction, even if their main effects are negligible, are also

important variables. The models generated in a factor influence study usually perform well

as predictive models and form the basis for optimization designs.

Optimization

At a later stage of investigation, when the variables that are important are already known,

one may wish to study the effects of these variables in more detail. Such a purpose will be

referred to as optimization. At the analysis stage this is also referred to as response surface

modeling.

Objectives of optimization

Optimization designs actually cover quite a wide range of objectives. They are particularly

useful in the following cases:

Maximizing a single response, i.e. to find out which combination of design variable

levels leads to the maximum value of a specific response, and what this maximum

response is.

290

Design of Experiments

Minimizing a single response, i.e. to find out which combination of design variable

levels leads to the minimum value of a specific response, and what this minimum is.

Finding a stable region, i.e. to find out which combination of design variable levels

corresponds to a specific target response, with the added criterion that small

deviations from those settings would cause negligible change in the response value.

Finding a compromise between several responses, i.e. to find out which combination

of design variable levels leads to the best compromise between several responses.

Describing response variations, i.e. to model response variations inside the

experimental region as precisely as possible in order to predict what will happen if

the settings of some design variables were changed in the future.

Models for optimization designs

The underlying idea of optimization designs is that the model should be able to describe a

response surface which has a minimum or a maximum inside the experimental range. To

achieve that purpose, linear and interaction effects are not sufficient. An optimization model

should also include quadratic effects, i.e. square effects, which describe the curvature of a

surface.

A model that includes linear, interaction and quadratic effects is called a quadratic model.

The designs with their fields of application and the allowed number of design variables are

listed below.

Available types of experimental design

Number

Type of Factor

Screening Optimization Field of Use of design

Design Influence

variables

low number of design

variables

independently from

Full

each other, including

Factorial X X 2-9

interaction terms. The

Design

only design that allows

for categorical

variables with 3 or

more levels

Depending on the

number of variables,

choose to study lower

order effects

Fractional

independently from

Factorial X X 3 - 13

each other, or create a

Design

screening design

aimed at find the most

important main effects

among many

X 8 - 35

Burman to fractional factorial

291

The Unscrambler X Main

Number

Type of Factor

Screening Optimization Field of Use of design

Design Influence

variables

effects only. Complex

interaction effects

levels of the design

variables by adding a

Central

few more experiments

Composite X 2-6

to a full factorial

Design

design. All design

variable must be

continuous

An alternative to

central composite

designs, when the

optimum response is

not located at the

Box- extremes of the

Behnken X experimental region 3 - 6

Design and when previous

results from a factorial

design are not

available. All design

variables must be

continuous

have multilinear

constraints, and

D-Optimal

X X X design is not 2 - 9

Design

orthogonal. Analysis

usually by Partial Least

Squares Regression

Contains mixture

Axial variables only, design

(Mixture) X region is simplex. Only 3 - 20

Design linear (first order)

effects can be found.

Contains mixture

Simplex-

variables only, design 3 - 6 (9 if

Lattice

X X X region is simplex. linear

(Mixture)

Tuneable lattice only)

Design

degree (order)

292

Design of Experiments

Number

Type of Factor

Screening Optimization Field of Use of design

Design Influence

variables

Simplex-

Contains mixture

Centroid

X variables only, design 3 - 6

(Mixture)

region is simplex

Design

A D-Optimal design will be used with mixture variables if the experimental region is not a

simplex, or if there is a combination of mixture and process variables in the design. The

design region is often non-simplex when upper limit constraints are added to some of the

mixture components.

This section introduces the nomenclature of variable types used in The Unscrambler®. Most

of these names are commonly used in the standard literature on experimental design;

however the use made of these names may differ somewhat between different softwares or

fields. Therefore it is recommended that the user reads this section before proceeding to

more details about the various types of designs.

Design vs. non-design variables

In The Unscrambler®, all variables appearing in the context of designed experiments can be

categorized as either design or non-design variables.

Design variables

Performing designed experiments is based on controlling the variations of the variables that

are being investigated to study their effects. Such variables with controlled variations are

called design variables, or factors.

In The Unscrambler®, a design variable is completely defined by:

Its name;

Its type: continuous or category;

Its constraints: mixture, linear;

Its levels.

Response variables

This is a type of non-design variables, they are the measured output variables that describe

the outcome (usually a quality attribute) of the experiments. These variables may often be

subject to an optimization.

Non-controllable variables

This second type of non-design variables refers to variables that can be monitored and may

have an influence on the response variables but that cannot controlled or reliably be fixed to

a value. For example the air humidity or the temperature of a plant.

Continuous vs. category variables

All variables have a pre-defined format or data type, and this format defines how the

variables are treated numerically and how they should be interpreted.

Continuous variables

All variables that have numerical values and that can be measured quantitatively are called

continuous variables. Note that this definition also covers discrete quantitative variables,

293

The Unscrambler X Main

such as counts. It reflects the implicit use which is made of these variables, namely the

modeling of their variations using continuous functions.

Examples of continuous variables are: temperature, concentrations of ingredients (e.g. in %),

pH, length (e.g. in mm), age (e.g. in years), number of failures in one year, etc.

The variations of continuous design variables are usually set within a predefined range,

which goes from a lower level to an upper level. Those two levels have to be specified when

defining a continuous design variable. More levels between the extremes may be specified if

the values are to be studied more specifically.

If only two levels are specified, the other necessary levels will be computed automatically.

This applies to center samples (which use a mid-level, halfway between lower and upper),

and axial (star) samples in optimization designs (which use extreme levels outside the

predefined range).

Category variables

In The Unscrambler®, all non-continuous variables are called category variables. Their levels

can be named, but not measured quantitatively. Examples of category variables are: color

(Blue, Red, Green), type of catalyst (A, B, C, D), place of origin (Africa, The Caribbean Islands,

…), etc.

Binary variables are a special type of category variables that have only two levels

(sometimes referred to as dichotomous). Examples of binary variables are: use of a catalyst

(Yes/No), recipe (New/Old), type of electric power (AC/DC), type of sweetener (Artificial/

Natural), etc.

For each category variable, the user must specify all levels. The number of levels can vary

between 2 - 20.

Note: Since there is a kind of quantum jump from one level to another (there is no

intermediate level in between), center samples cannot be defined for category

variables. If there is a mix of category and continuous variables in the design, center

samples are defined for all continuous variables at each level of the category

variables.

Mixture variables

When performing experiments where some ingredients are mixed according to a recipe, one

may be in a situation where the amounts of the various ingredients cannot be varied

independently from each other. In such a case, one will need to use a special kind of design

called a Mixture design, and the design variables are called mixture variables (or mixture

components).

An example of a mixture situation is blending concrete from the following three ingredients:

cement, sand and water. If the percentage of water in the blend is increased by 10%, the

proportions of one of the other ingredients (or both) will have to be reduced so that the

blend still amounts to 100%.

However, there are many situations where ingredients are blended, which do not require a

mixture design. For instance in a water solution of four ingredients whose proportions do

not exceed a few percent, one may vary the four ingredients independently from each other

and just add water at the end as a “filler”. Therefore it is important to carefully consider the

experimental situation before deciding whether the recipe being followed requires a mixture

design or not!

Process variables

In a mixture situation, one may also want to investigate the effects of variations in some

other design variables which are not themselves a component of the mixture. Such variables

294

Design of Experiments

are called process variables in The Unscrambler®, and these are analyzed using a D-optimal

design.

Typical process variables are: temperature, stirring rate, type of solvent, amount of catalyst,

etc.

The Unscrambler® provides three classical types of screening designs for unconstrained

situations:

Full-factorial designs for a number of design variables usually between 2 and 5

(maximum 9); the design variables may be two-level continuous or category with 2

to 20 levels.

Fractional-factorial designs for any number of two-level design variables (continuous

or category) between 3 and 13.

Plackett-Burman designs for any number of two-level design variables (continuous

or category) between 8 and 35.

Full-factorial designs

Full-factorial designs combine all defined levels of all design variables. For instance, a full-

factorial design investigating one two-level continuous variable, one three-level category

variable and one four-level category variable will include 2x3x4=24 experiments (excluding

center points).

Among other properties, full-factorial designs are perfectly balanced, i.e. each level of every

design variable is studied an equal number of times in combination with every level of the

other design variables.

Full-factorial designs include enough experiments to allow use of a model with all

interactions included. This can be very beneficial if the number of design variables is low,

however it comes at the prize of having to perform a high number of experiments if more

than a few variables are included. In this case, a fractional factorial design should be

considered.

Note: In theory a full factorial design can accommodate any number of levels also

for continuous variables, and such a design could be used for optimization. Because

central composite and Box-Behnken designs are much more economical than a 3-

level (or higher) full-factorial design, only two levels are allowed for continuous

variable factorial designs in The Unscrambler.

Fractional-factorial designs

In the specific case where there are only two-level variables (continuous with lower and

upper levels, and/or binary variables), one can define fractions of full factorial designs that

enable the investigation of as many design variables as the chosen full-factorial designs with

fewer experiments. These “economic” designs are called fractional factorial designs.

Given that a full-factorial design suitable for the investigation has already been defined, a

fractional design might be set up by selecting half the experimental runs of the original

design. For instance, one might try to study the effects of three design variables with only 4

(2(3-1)) instead of 8 (23) experiments. Larger factorial designs admit fractional designs with a

higher degree of fractionality, i.e. even more economical designs, such as investigating nine

design variables with only 16 (2(9-5) ) experiments instead of 512 (29). Such a design can be

referred to as a fractional design; its degree of fractionality is 5. This means that one

investigates nine variables at the usual cost of four (thus saving the cost of five).

295

The Unscrambler X Main

In order to better understand the principles of fractionality, the following illustrates how a

fractional factorial is built in the following concrete case: computing the half-fraction of a full

factorial with four variables (2 (4-1)).

In the following tables, the design variables are named A, B, C, D, and their lower and upper

levels are coded – and +, respectively.

First, the full factorial design is built with only variables A, B, C (2 ³), as shown below:

Full-factorial design 2³

Experiment A B C

1 – – –

2 + – –

3 – + –

4 + + –

5 – – +

6 + – +

7 – + +

8 + + +

In the table below additional columns are generated, which are computed from the products

of the original three columns A, B, C. These additional columns represent the interactions

between the design variables.

Full-factorial design 2³ with interaction columns

Experiment A B C AB AC BC ABC

1 – – – + + + –

2 + – – – – + +

3 – + – – + – +

4 + + – + – – –

5 – – + + – – +

6 + – + – + – –

7 – + + – – + –

8 + + + + + + +

The above design table is an example of an orthogonal table, i.e. the effect of each column

(main effect and interaction) can be estimated independently of each other.

In the table below, the column representing the highest degree of interaction (the ABC

interaction) is assigned to the variable, D, as it is assumed that the ABC interaction is

negligible:

Fractional factorial design 2(4-1)

Experiment A B C D

296

Design of Experiments

Experiment A B C D

1 – – – –

2 + – – +

3 – + – +

4 + + – –

5 – – + +

6 + – + –

7 – + + –

8 + + + +

This new design allows the main effects of the four design variables to be studied

independently of each other; but what about their interactions? The table below shows all

of the two-factor interactions calculated after setting D = ABC.

Fractional-factorial design 2(4-1)) with interaction columns

Experiment A B C D AB = CD AC = BD BC = AD

1 – – – – + + +

2 + – – + – – +

3 – + – + – + –

4 + + – – + – –

5 – – + + + – –

6 + – + – – + –

7 – + + – – – +

8 + + + + + + +

This table shows that each of the last three columns is shared by two different interactions

(for instance, AB and CD share the same column).

Confounding

Unfortunately, as the above example shows, there is a price to be paid for saving on the

experimental costs! “He who invests less, will also harvest less”.

In the case of fractional factorial designs, this means that if one does not use the full-

factorial set of experiments, it is not possible to study the interactions as well as the main

effects of all design variables. This happens because of the way those fractions are built,

using some of the resources that would otherwise have been devoted to the study of

interactions, to study main effects of more variables instead.

This side effect of using fractional designs is called confounding. Confounding means that

some effects cannot be studied independently of each other.

For instance, in the above example, the two-factor interactions are all confounded with each

other. The practical consequences are the following:

297

The Unscrambler X Main

All main effects can be studied independently of each other, and independently of

the interactions;

If the objective is to study the interactions themselves, using this specific design will

only enable one to detect whether either of the confounded interactions are

important. The experiments will not allow one to decide which are the important

ones. For instance, if AB (confounded with CD, “AB=CD”) turns out as significant, one

will not know whether AB or CD (or a combination of both) is responsible for the

observed effect.

The list of confounded effects is called the confounding pattern of the design.

Resolution of a fractional factorial design

How well a fractional-factorial design avoids confounding is expressed through its resolution.

The three most common cases are as follows:

Resolution III designs: Main effects are confounded with two-factor interactions.

Resolution IV designs: Main effects are free of confounding with two-factor

interactions, but two-factor interactions are confounded with each other.

Resolution V designs: Main effects and two-factor interactions are free of

confounding with each other, however some two-factor interactions are

confounded with three-factor interactions.

Definition: In a resolution R design, effects of order k are free of confounding with all effects

of order less than R-k.

In practice, before deciding on a particular factorial design, it is important to check its

resolution and its confounding pattern to make sure that it fits the experimental objectives!

Examples of factorial designs

A screening situation with three design variables is illustrated in the two examples below:

Options for screening design with three design variables

Full factorial (left) and fractional factorial (right) designs illustrated. The design points are

marked red. The points in the fractional factorial design are selected so as to cover the

maximum volume of the design space.

Plackett-Burman designs

If the experimental objective is to study the main effects only, and there are many design

variables to investigate (e.g. > 10), Plackett-Burman (PB) designs may be the solution. They

are very economical, since they require only one to four more experiments than the number

of design variables.

Plackett–Burman designs (Plackett and Burman, 1946) are experimental designs developed

while the authors were working in the British Ministry of Supply. Their goal was to find

298

Design of Experiments

number of independent variables (factors), each taking L levels. The designs were developed

in such a way as to minimize the variance of the estimates of these dependencies using a

limited number of experiments. Interactions between the factors were considered

negligible. The solution to this problem is to find an experimental design in which each

combination of levels for any pair of factors appears the same number of times. A complete

factorial design would satisfy this criterion, but the idea was to find smaller designs. An

example of a PB design is provided below.

Plackett–Burman design for 12 runs and up to 11 two-level factors

Run A B C D E F G H J K L

1 + − + − − − + + + − +

2 + + − + − − − + + + −

3 − + + − + − − − + + +

4 + − + + − + − − − + +

5 + + − + + − + − − − +

6 + + + − + + − + − − −

7 − + + + − + + − + − −

8 − − + + + − + + − + −

9 − − − + + + − + + − +

10 + − − − + + + − + + −

11 − + − − − + + + − + +

12 − − − − − − − − − − −

For the case of two levels (L=2), Plackett and Burman used the construction of Paley (Paley,

1933) for generating orthogonal matrices whose elements are all either 1 or -1 (Hadamard

matrices). Paley’s method could be used to find such matrices of N rows for most N equal to

a multiple of 4. In particular, it worked for all such N up to 100 except N = 92. If N is a power

of 2, however, the resulting design is identical to a fractional factorial design. In The

Unscrambler® the maximum limit of N is 36, which can accommodate n = N-1 = 35 design

variables (main effects). If there are less than N-1 effects to estimate, a subset of the

columns of the matrix is used.

The prize to pay for estimating all these effects in a minimum number of runs, is the very

complex confounding patterns of Plackett-Burman designs. Main effects are often partially

confounded with several interactions, and these designs should therefore be used very

carefully.

The Unscrambler® provides two classical types of optimization designs:

Box-Behnken designs for 3 to 6 continuous design variables.

299

The Unscrambler X Main

Central composite designs (CCD) are extensions of two-level full factorial designs. A CCD

enables a quadratic model to be fitted by including new levels in addition to the regular

lower and upper levels.

A CCD consists of three types of experiments:

Factorial (cube) samples are experiments which combine the regular lower and

upper levels of the design variables; they are the “factorial” part of the design;

Center samples are replicates of the experiment for which all design variables are at

their mid-level;

Axial (star) samples are located such that they extend beyond the factorial levels of

the design for one factor at the time, all other design variables being at their mid-

level. These samples are specific to CCD designs.

Properties of a CCD

The properties of the simplest CCD, with two design variables is shown below.

Central composite design with two design variables

From the figure it can be seen that each design variable has five levels: 1) low axial, 2) low

factorial, 3) center, 4) high factorial, and 5) high axial. Low factorial and high factorial are the

lower and upper levels that are specified when defining the design variable.

The four factorial samples are located at the corners of a square (or a cube if there

are three variables, or a hypercube if there are more);

The center samples are located at the center of the square;

The four axial samples are located outside the square; by default, their distance to

the center is set to ensure rotatability (see below).

Because we do not know the position of the response surface optimum, we try to ensure

that the prediction error is the same for any point at the same distance from the center of

the design. This property is called rotatability, as the design axes can be rotated around the

origin without influencing the variance of the predicted response. This implies that the

information carried by any design point will have equal weight on the analysis, i.e. the design

points will have equal leverage. This property is important if one wants to achieve uniform

quality of prediction in all directions from the center. The distance that ensures rotatability

is given by 2k/4, k being the number of factors.

A spherical design is one in which all factorial and axial points have the same distance from

the origin. The 2- and 4- factor rotatable designs are also spherical designs (distance given by

k1/2).

300

Design of Experiments

Types of CCD

Circumscribed central composite design (CCC)

This general type is the one described in the previous section, with factorial points

defined at the lower and upper levels and with axial points outside of these ranges.

Faced central composite design (CCF)

If for some reason one cannot use levels outside the factorial range, one can tune

the axial point distances down such that these points lie at the center of the cube

faces. This is called a faced central composite design (CCF). CCF designs are not

rotatable.

Inscribed central composite design (CCI)

Another way to keep all experiments within the pre-defined range is to use an axial

sample distance that ensures rotatability, but to shrink the entire design such that

the axial points fall on the pre-defined levels. This will result in a smaller investigated

range, but will guarantee a rotatable design. This is called an inscribed central

composite design (CCI).

Efficiency of the CCD

Depending on the constraints of the experiments and the accuracy to achieve, select the

appropriate CC design using the following table:

Central composite design: constraints and accuracy

Number of Uses point outside

Design Accuracy of estimates

levels high and low levels

Inscribed 5 No

design space

Faced 3 No

for pure quadratic coefficients

Box-Behnken designs

Box-Behnken designs are not built on a factorial basis, but they are nevertheless good

optimization designs for second order models.

In a Box-Behnken design, all design variables have three levels: low cube, center, and high

cube. Each experiment combines the extreme levels of two or three design variables with

the mid-levels of the others. In addition, the design includes a number of center samples.

The properties of Box-Behnken designs are the following:

The actual range of each design variable is low cube to high cube, which makes it

easy to handle;

All non-center samples are located on a sphere, achieving rotatability for the 4-

factor design, and almost rotatability for the designs with 3, 5, or 6 factors.

Number of Uses point outside

Design Accuracy of estimates

levels high and low levels

Box

3 No uncertainty on the edge of the design

Behnken

area

301

The Unscrambler X Main

A central composite design for three design variables is shown here:

Central composite design with three design variables

The figure below shows the Box-Behnken design drawn in two different ways. In the left

drawing one can see how it is built, while the drawing to the right shows how the design is

rotatable.

Box-Behnken design

This chapter introduces “tricky” situations in which classical designs based upon the factorial

principle do not apply. Here two related cases will be discussed:

General constraints in which the allowed levels of a design variable depend on the

levels of one or more of the other design variables: linear constraints;

The special case of mixture situations, in which the levels of all design variables sum

to a fixed, total amount.

Each of these situations will then be described extensively in the following sections.

Note: Understanding the sections that follow requires basic knowledge about the

purposes and principles of experimental design. If the principles of experimental

design are unfamiliar, the user is strongly urged to read about it in the previous

sections (see What Is Experimental Design?) before proceeding with this section.

Mixture designs

A simple mixture design example

We will start describing the mixture situation by using an example.

A product development specialist has a specific problem to solve related to the optimization

of a pancake mix. The mix consists of the following ingredients: wheat flour, sugar and egg

powder. It will be sold in retail units of 100 g, to be mixed with milk for reconstitution of

pancake batter.

302

Design of Experiments

The product developer has learned about experimental design, and tries to set up an

adequate design to study the properties of the pancake batter as a function of the amounts

of flour, sugar and egg in the mix. She starts by plotting the region that encompasses all

possible combinations of those three ingredients, and soon discovers that it has a distinct

shape.

The pancake mix experimental region

The reason, as you may have guessed, is that the mixture always has to add up to a total of

100 g. This is a special case of multilinear constraint, which can be written with a single

equation:

This is called the mixture constraint: the amounts of all mixture components always have to

sum to 100% of the total product. This means that if you know the amounts of flour and

sugar in the mix, the amount of egg can be deduced by subtraction from 100%. In other

words, even if there are three mixture components, only two of them can be varied

independently at any time. The practical consequence is that the mixture region defined by

three ingredients is not a three-dimensional region! It is contained in a two-dimensional

surface called a simplex.

A simplex is a generalization of a triangle in possibly higher dimensions. If there are N

mixture components, the dimensionality of the simplex is N-1. For instance, for 4 mixture

components, the simplex is a tetrahedron. There is a special class of designs called mixture

designs which are based on regular simplexes.

Designs based on a simplex

Since the region defined by the three mixture components in the previous example is a two-

dimensional surface, we cannot use a factorial design to analyze the design region. Rather,

the design region is given below.

The pancake mix simplex

303

The Unscrambler X Main

This simplex contains all possible combinations of the three ingredients flour, sugar and egg.

One can see that it is completely symmetrical. One could substitute egg for flour, sugar for

egg and flour for sugar in the figure, and still get exactly the same shape.

Classical mixture designs, first introduced by Scheffé, 1958, take advantage of this symmetry.

They include a varying number of experimental points, depending on the purposes of the

investigation. But whatever this purpose and whatever the total number of experiments,

these points are always symmetrically distributed, so that all mixture variables play equally

important roles.

These designs thus ensure that the effects of all investigated mixture variables will be

studied with the same precision. This property is equivalent to the properties of factorial,

central composite or Box-Behnken designs for non-constrained situations.

The figure below shows two examples of classical mixture designs.

Two classical designs for three mixture components

The first design is very simple. It contains three vertices (pure mixture components), three

edge centers (binary mixtures) and only one ternary mixture or the centroid. The second

design contains more points, spanning the mixture region regularly in a triangular lattice

pattern. It contains all possible combinations (within the mixture constraint) of five levels of

each ingredient. It is similar to a five-level full factorial design - except that many

combinations, such as “25%, 25%, 25%” or “50%, 75%, 100%”, are excluded because they

are outside the simplex.

Simplex with different boundaries

This example, taken from John A. Cornell’s reference book “Experiments With Mixtures”

Cornell 1990, illustrates a how additional constraints are sometimes useful in practical

situations.

A fruit punch is to be prepared by blending three types of fruit juice: watermelon, pineapple

and orange. The purpose of the manufacturer is to use their large supplies of watermelons

by introducing watermelon juice, of little value by itself, into a blend of fruit juices.

304

Design of Experiments

Therefore, the fruit punch should contain at least 30% of watermelon juice. Pineapple and

orange have been selected as the other components of the mixture.

The manufacturer decides to use design of experiments to find the combination of fruit

juices that scores highest in a consumer preference survey. The ranges of variation selected

for the experiment are as follows:

Ranges of variation for the fruit punch design

Ingredient Low High Centroid

The resulting experimental design has a number of features that makes it very different from

a factorial or central composite design.

First, the ranges of variation of the three variables are not independent. Since watermelon

has a lower level of 30%, the high level of pineapple cannot exceed 100 - 30 = 70% (in which

case the orange content would be 0%). The same holds true for orange.

The second feature concerns the levels of the three variables for the point called the

“centroid”: these levels are not halfway between “low” and “high”, they are closer to the

low level. The reason is, once again, that the blend has to add up to a total of 100%.

Since the concentrations of the ingredients cannot vary independently of each other, these

variables cannot be handled in the same way as the design variables encountered in a

factorial design. Whenever the ranges of the mixture components result in a simplex design

region, a selection of classical mixture designs are available instead. One example of a

mixture design for the optimization of Cornell’s fruit punch is shown below. It is seen that

the design region remains simplex even if the lower boundary of watermelon juice has been

increased.

Design for the optimization of fruit punch

In a screening situation, the primary objective is to study the main effects of each of the

mixture components.The main effect of an input variable is the change occurring in the

response variable when the input varies from low to high, all experimental conditions being

otherwise comparable.

In a factorial design, the levels of the design variables are combined in a balanced way, so

that one can follow what happens to the response value when a particular design variable

goes from low to high. It is possible to compute the main effect of that design variable

305

The Unscrambler X Main

without regard to the remaining factors, because its low and high levels have been

combined with the same levels of all the other design variables.

In a mixture situation, this is no longer possible, as demonstrated in the previous figure.

While 30% watermelon can be combined with e.g. (70% P, 0% O) or (0% P, 70% O), 100%

watermelon can only be combined with (0% P, 0% O).

To find a solution to this problem the concept of “otherwise comparable conditions” must

be adapted to the constrained mixture situation. To screen what happens when watermelon

varies from 30% to 100%, this variation must be compensated in such a way that the mixture

still adds up to 100%, without disturbing the balance of the other mixture components. This

is achieved by moving along an axis where the proportions of the other mixture components

remain constant. In practice such mixtures are easily achieved by starting with the low level

of the component in questions while having equal proportions of the remaining

components. Subsequent addition of the first component to the mix would correspond to

moving up the axis. This is illustrated for the watermelon example in the figure below.

Studying variations in the proportion of watermelon

Mixture designs with points along the axes of the simplex are called axial designs. They are

best suited for screening purposes because they capture the main effect of each mixture

component in a simple and economical way.

An axial design in four components is represented in the next figure. It can be seen that

several points are located inside the simplex: they are mixtures of all four components. Only

the four corners, or vertices (containing the maximum concentration of an individual

component) are located on the surface of the experimental region.

A four-component axial design

Each axial point is placed halfway between the overall centroid of the simplex (25%, 25%,

25%, 25%) and a specific vertex. Thus the path leading from the centroid (“neutral”

situation) to a vertex (100% of a single component) is well described with the help of the

axial point.

306

Design of Experiments

In addition, end points can be included; they are located on the surface of the simplex,

opposite a vertex (they are marked by crosses on the figure). They contain the minimum

concentration of a specific component. When end points are included in an axial design, the

whole path leading from minimum to maximum concentration is studied. The above figure

Design for the optimization of the fruit punch composition is an example of a three-

component axial design where end points have been included.

For the optimization of the concentrations of several mixture components, one needs a

design that enables a highly accurate prediction for any mixture - whether it involves all

components or only a subset.

Peculiar behavior may occur when the concentration of a mixture component drops down to

zero. For instance, to prepare the base for a Dijon mayonnaise, one needs to blend Dijon

mustard, egg and vegetable oil. But what happens when the egg is removed from the

recipe? The resulting dressing will have a different appearance and texture. This illustrates

the importance of interactions (e.g. between egg and oil) in mixture applications.

Thus, an optimization design for mixtures will include a large number of blends of only two,

three, or more generally, a subset of the components to be studied. The most regular design

including those sub blends is called a simplex-centroid design. It is based on the centroids of

the simplex: balanced blends of a subset of the mixture components of interest. For

instance, to optimize the concentrations of three ingredients, each of them varying between

0 and 100%, the simplex-centroid design will consist of:

The three edge centers (or centroids of the two-dimensional subsimplexes defining

binary mixtures): (50,50,0), (50,0,50) and (0,50,50);

The overall centroid: (33,33,33).

A 4-component simplex-centroid design

In general terms, if N mixture components vary from 0 to 100%, the blends forming the

simplex-centroid design are as follows:

The vertices are pure components;

The second order centroids (edge centers) are binary mixtures with equal

proportions of selected two components;

The third order centroids (face centers) are ternary mixtures with equal proportions

of selected three components;

The Nth order centroids have equal proportions of selected N components, any

remaining components being zero.

307

The Unscrambler X Main

Note: The overall centroid is a mixture where all N components have equal

proportions.

In addition, interior points can be included in the design. They improve the precision of the

results by “anchoring” the design with additional complete mixtures (i.e. mixtures where all

components are present), and they enable computation of cubic terms. The interior points

are located halfway between the overall centroid and each vertex, and they have the same

composition as the axial points in an axial design. When a design includes interior points, it is

said to be augmented. Note that for 3 mixture components, a centroid design augmented

with axial points equals an axial design with end points included (see e.g. fruit punch

example above).

Sometimes one may not be specifically interested in a screening or optimization design. One

may be doing exploratory experiments. For example, one may just want to investigate what

would happen if three ingredients that have never been mixed before were combined.

This is one of the cases where the main purpose is to cover the mixture region as evenly and

regularly as possible. Designs that address that purpose are called simplex-lattice designs.

They consist of a network of points located at regular intervals between the vertices of the

simplex. Depending on how thoroughly you want to investigate the mixture region, the

network will be more or less dense, including a varying number of intermediate levels of the

mixture components. As such, it is quite similar to an N-level full factorial design. The figure

below illustrates this similarity.

A fourth degree simplex-lattice design is similar to a five-level full factorial

(number of intervals between points along the edge of the simplex). Here are a few:

Feasibility study (degree one or two): are the blends feasible at all?

Optimization: with a lattice of degree three or more, there are enough points to fit a

precise response surface model.

Search for a special behavior or property which only occurs in an unknown, limited

subregion of the simplex.

Calibration: prepare a set of blends on which several types of properties will be

measured, in order to fit a regression model to these properties. For instance, one

may wish to relate the texture of a product, as assessed by a sensory panel, to the

parameters measured by a texture analyzer. If it is known that texture is likely to

vary as a function of the composition of the blend, a simplex-lattice design is

probably the best way to generate a representative, balanced calibration data set.

D-optimal designs

A simple design subject to linear constraints

308

Design of Experiments

A manufacturer of prepared foods wants to investigate the impact of several processing

parameters on the sensory properties of cooked, marinated meat. The meat is to be first

immersed in a marinade, then steam-cooked, and finally deep-fried. The steaming and frying

temperatures are fixed; the marinating and cooking times are the process parameters of

interest.The process engineer wants to investigate the effect of the three process variables

within the following ranges of variation:

Ranges of the process variables for the cooked meat design

Process variable Low High

A full factorial design would give the following factorial (cube) experiments:

The cooked meat full factorial design

Sample Mar. Time Steam. Time Fry. Time

1 6 5 5

2 18 5 5

3 6 15 5

4 18 15 5

5 6 5 15

6 18 5 15

7 6 15 15

8 18 15 15

After carefully analyzing this table, the process engineer expresses strong doubts that

experimental design can be of any help in this situation.

“Why?” asks the statistician in charge. “Well,” replies the engineer, “if the

meat is steamed then fried for 5 minutes each it will not be cooked, and at

15 minutes each it will be overcooked and burned on the surface. In either

case, we won’t get any valid sensory ratings, because the products will be far

beyond the ranges of acceptability.”

After some discussion, the process engineer and the statistician agree that an additional

condition should be included:

“In order for the meat to be suitably cooked, the sum of the two cooking

times should remain between 16 and 24 minutes for all experiments”.

This type of restriction is called a multilinear constraint. In the current case, it can be written

in a mathematical form requiring two equations, as follows:

The impact of these constraints on the shape of the experimental region is shown in the two

figures below:

309

The Unscrambler X Main

The constrained experimental region is no longer a cube! It follows that a full factorial design

poorly explores that region.

The design that best spans the new region is given in the table below:

The cooked meat constrained design

Sample Mar. Time Steam. Time Fry. Time

1 6 5 11

2 6 5 15

3 6 9 15

4 6 11 5

5 6 15 5

6 6 15 9

7 18 5 11

8 18 5 15

9 18 9 15

10 18 11 5

11 18 15 5

12 18 15 9

This design contains all “corners” of the experimental region, in the same way as the full

factorial design does when the experimental region has the shape of a cube.

310

Design of Experiments

Depending on the number and complexity of multilinear constraints, the shape of the

experimental region can be more or less complex. In the worst cases, it may be almost

impossible to imagine! Therefore, building a design to screen or optimize variables linked by

multilinear constraints requires special methods. The following section will introduce a

special class of designs beneficial for these situations. More complex examples will be given

in the section Advanced topics for constrained situations ways to build constrained designs.

Introduction to the D-optimal principle

Those familiar with factorial designs are most likely aware that one of their most important

characteristics is their ability to study all effects independently of each other. This property,

called orthogonality, is important for relating variations in responses to variations in the

design variables. Without orthogonality, the estimated effects may become unreliable.

As soon as multilinear constraints are introduced among the design variables, it is no longer

possible to build an orthogonal design. Considering that the effect of a variable is estimated

on the premise that all other influences are held constant, it may not come as a surprise that

associations between design variables make the interpretations more difficult. In the more

severe cases of dependencies between variables, the effects will become indistinguishable

or the numerical calculations will fail. As soon as the variations in one of the design variables

are linked to those of another design variable, orthogonality cannot be achieved.

The D-optimal principle ensures that, based on a set of candidate points, the selected design

matrix has columns as close to orthogonal as possible. Mathematically, this is achieved

by maximizing the determinant of the information matrix , which is known as the D-

optimality criterion (Apostrophe meaning ‘transposed’). The volume of the joint confidence

region of the resulting regression coefficients is thereby minimized, i.e. the precision of

model parameter estimates will be maximized. An example of a design matrix could be

the cooked meat constrained design table above, including some or all of the available

design points (rows) as well as any center points or replicates. Also, any interaction or higher

order terms would be included as additional columns in .

Because the determinant of tends to increase as more experimental runs are

included in the design, the D-optimality criterion is not well suited for comparing designs of

different sizes. The related D-efficiency is independent of the number of runs.

Here, n is the number of experimental runs and p is the number of model terms. The D-

efficiency ranges from 0 to 100%, where a factorial design without centerpoints has a D-

efficiency of 100%. While a large design will tend to have a larger value of and yield

a smaller confidence region for the parameters, the average point precision as estimated by

the D-efficiency will be comparable for differently sized designs.

Candidate design points

A point exchange algorithm is used to find the D-optimal design points in The Unscrambler®.

These points may optionally be augmented with a number of space filling points to ensure

good coverage also inside the experimental region. Both these procedures require a set of

candidate points as input. These points are set up in such a manner that they span the

maximum allowed design region as well as the interior region. The candidate points are

All extreme vertices. These are the outer corners of the design region:

The extreme vertices of a square design region

311

The Unscrambler X Main

All edge centers. These are defined as the midpoint between any two vertices constituting

an outer edge of the design region:

The edge centers of a square design region

All face centers. These are defined as the center point on any outer surface of the design

region as spanned by three or more edges:

The face centers of a square design region

The overall centroid. This is the center point of the design. For a design with two design

variables only the overall centroid overlaps with the single face center.

All axial check blends. These are defined as the midpoint on any axis spanned by the overall

centroid and the extreme vertices. These do not improve the coverage of the outer design

region but can be very useful space filling points for more robust models:

The axial check blends of a square design region

A D-optimal design containing a specified number of D-optimal points are found based on

the Fast Fedorov Exchange Algorithm (FFEA) Nguyen and Piepel, 2005. Partially random

starting designs are used in which a smaller subset of points is selected randomly, and then

points are added one by one to maximize the D-efficiency. When the pre-specified number

of design points have been included the design is optimized using the FFEA. The best D-

optimal design is finally selected from several such partially random starts. This ensures that

a good design is found that is less likely to result from a local maximum.

The points are selected from the candidate list without replacement. This means that the

algorithm itself will never return replicates of the selected points, and the maximum number

312

Design of Experiments

of points is bounded by the number of candidate points in each case. The number of

additional center points (overall centroids) as well as the number of replicates for the entire

design is specified separately. This enables a higher level of user control over the

replications, and it favours a better spread of points over the design region compared to

selection with replacement. On the other hand the D-efficiency of the resulting design may

be slightly lower than if replication had been allowed. For practical use we believe the

benefits of a good spread in design points far outweight a small reduction in D-efficiency

(see next section).

Addition of space filling points

The list of D-optimal points returned from the FFEA is optionally used as a starting point for

a subsequent Kennard-Stone selection process Kennard and Stone, 1969. During this

process, the design is augmented with a specified number of space filling points in order to

span the entire design region as evenly as possible. These points are taken from the

remaining candidate list, i.e. the selection is based on candidate points that have not already

been selected in the point exchange algorithm.

While D-optimal designs provide precise model terms and good predictions of training data,

they tend to focus on the outer regions of the design space. It has been shown that designs

with samples spread evenly across the entire design region tend to be more robust in many

cases Naes and Isaksson, 1989. Inclusion of space filling points by Kennard-Stone enables

better modeling of the interior design region and may therefore give more accurate

response surfaces and stable predictions when applying the model on new data. Also space

filling points tend to make the design less dependent on which model terms are included.

This is beneficial because the exact model equation is usually not known in advance.

The condition number (C.N.)

In order to minimize the negative consequences of a deviation from the ideal orthogonal

case, one needs a measure of the “lack of orthogonality” of a design. This measure is

provided by the condition number (C.N.) Golub, 1996:

C.N. = largest eigenvalue / smallest eigenvalue of the matrix

It indicates the degree of multicollinearity in the design matrix as follows:

C.N. = 1: no multicollinearity, i.e. orthogonal

C.N. < 100: multicollinearity not a serious problem

100 < C.N. < 1000: moderate to severe multicollinearity

C.N. > 1000 severe multicollinearity

It is also linked to the elongation or degree of “non-sphericity” of the region actually

explored by the design. The smaller the condition number, the more spherical the region,

and the closer a design is to being orthogonal.

Another important property of an experimental design is its ability to explore the whole

region spanned by the design variables. It can be shown that once the shape of the

experimental region has been determined by the constraints, the design with the smallest

condition number is the one that encloses maximal volume. It follows that if all extreme

vertices are included in the design, it has the smallest attainable condition number. If that

solution is too expensive, however, one needs to select a smaller number of points. The

consequence is that the condition number will increase and the enclosed volume will

decrease.

How good is the calculated design?

The condition number of an orthogonal design such as a non-modified factorial design is

exactly 1. Such a design has optimal properties in terms of interpretation, mathematical

robustness and economical considerations. The condition number of a non-orthogonal

(constrained) design will always be larger than one, and the larger the deviation, the less

313

The Unscrambler X Main

favorable is the design. In general, caution should be exercised when analyzing a non-

orthogonal design using classical DoE Analysis(ANOVA/MLR). The Unscrambler® suggests

analysis by Partial Least Squares Regression for D-optimal designs, ascorrelated effects are

handled much better by this method and misinterpretations will be rare.

If the design has a condition number much larger than, say, 100, this is an indication that the

experimental region is heavily constrained. In such a case either of several design factors

may have influence on the response, but it is impossible to find out which (ANOVA might

suggest one of them arbitrarily, PLSR will correctly reveal that both are correlated with the

response). This may occur when there is insufficient individual variation in the design levels

compared to the noise level of the experiment. To ensure sufficient orthogonal variation for

each effect, it is recommended that all of the design variables and constraints be critically re-

examined. One should search for ways to simplify the problem see the section on Advanced

Topics for Constrained Situations, otherwise there is the risk of starting an expensive series

of experiments which will not give any useful information.

We will use the the marinated meat example above to illustrate a design with multilinear

constraints. For simplification, we can focus on the “Steaming time” and “Frying time” and

take into account only one constraint:

The figure below shows the impact of the constraint on the variations of the two design

variables.

The constraint cuts off one corner of the “cube”

A full factorial design applied to this situation would result in a sub-optimal solution that left

one half of the experimental region unexplored (i.e. the triangle spanned by the remaining 3

points). So where should we place the 4th point in order to span the experimental region as

well as possible?

We could imagine two candidate points where the dashed line of the linear constraint

crosses the factorial design region in the above figure. Two alternative solutions for selecting

4 design points are illustrated below.

Designs with four points leaving out a portion of the experimental region

314

Design of Experiments

Design II in the figure seems to be a better option than design I, because the excluded region

is smaller. A design using points (1, 3, 4, 5) would be equivalent to (I), and a design using

points (1, 2, 4, 5) would be equivalent to (II). The worst solution of all would be a design with

points (2, 3, 4, 5): this would leave out the whole corner defined by points 1, 2 and 5.

It follows that if the whole experimental region was to be explored, more than four points

would be needed. The above example shows that a minimum of five points (1, 2, 3, 4, 5) are

necessary. These five crucial points are the extreme vertices of the constrained experimental

region. They have the following property: if a sheet of paper was wrapped around those

points, the shape of the experimental region would appear, revealed by the wrapping.

If there are more than two design variables or multiple constraints it might not be straight

forward to find the best set of design points. The D-optimal criterion is commonly used to

find the best design in these situations.

D-optimal designs may also be used for analyzing mixtures. This is useful if there are upper

constraints on some of the mixture components such that the design region is non-simplex

(refer to the section, Is the Mixture Region a Simplex?). While the regular mixture designs

cannot handle these cases, a D-optimal design can be used by including a constraint that all

mixture components should sum to 100%. Additional upper or lower levels on any of the

mixture components will then have to be added as separate multilinear constraints.

Note: Classical mixture designs have much better properties than D-optimal

designs. Remember this before establishing additional constraints on mixture

components.

Process/mixture designs

Sometimes the product properties of interest depend on a combination of a mixture recipe

with specific process settings. In such cases, it is useful to investigate mixture and process

variables together. The process variables and the mixture variables are then combined using

the pattern of subfactorial designs and a D-optimal design can be generated.

This section presents an overview of the various types of samples to be found in

experimental designs, along with their properties.

Factorial (cube) samples

315

The Unscrambler X Main

Factorial samples can be found in factorial designs and their extensions. They are a

combination of high and low levels of the design variables in experimental plans based on

two levels of each variable. This forms a square for 2 variables or a (multidimensional) cube

for 3 (or more) variables. These samples are therefore sometimes referred to as cube

samples.

The same factorial design points are also found among other samples in central composite

designs. In Box-Behnken designs, all samples found on the factorial cube are also called

factorial samples (even though these design points are positioned on the edges rather than

the vertices of the cube).

All combinations of levels of the design variables in N-level full factorials are also called

factorial samples.

Center samples

Center samples are samples for which each design variable is set at its mid-level. When all

variables are continuous, the center points are located at the exact center of the

experimental region.

Center samples are not defined for categorical factors. When there is a combination of

continuous and category variables in the design, center points corresponding to the mid-

level of all continuous factors can be added for each unique combination of levels for up to 4

category variables.

For instance, if the number of two-level category variables in the design is (1, 2, 3, 4), this

results in (2, 4, 8, 16) single replicate center points, respectively. If two replicates of center

points are required, this doubles the total number of center points in the design. If we have

a three variable full factorial design with two two-level categorical variables, there are four

unique center points corresponding to the different level combinations of the categorical

factors. If 2 replicates of the center points are required, this results in 8 center points in

total.

The higher number of levels for the categorical variables and the more replication required,

the number of center points can grow large very quickly. It is suggested that when either the

number or levels of categorical variables becomes larger than 2, design replication may be a

better option.

Center samples in screening designs. In screening designs, center samples are used for

curvature checking: Since the underlying model in such a design assumes that all main

effects are linear, it is useful to have at least one design point with an intermediate level for

all factors. Thus, when all experiments have been performed, one can check whether the

intermediate value of the response fits with the global linear pattern, or whether there are

signs of deviation from the straight line fit.

In the case of high curvature, one will have to build a new design which accepts a quadratic

model. The Unscrambler® provides an option to calculate curvature in a design when all

variables are continuous and at least one center point is present.

If at least 2 center samples are present (preferably 3), the model will also be tested for lack

of fit (LOF). This is a test comparing the variation of the measured responses within center

samples with the overall variation between measured and fitted (i.e. predicted) response

values. A significant LOF indicates that the model might benefit from additional terms.

In screening designs, center samples are optional; however, it is recommended that at least

three are included if possible. See the section on replicates for more details.

Center samples in optimization designs. In optimization designs, center samples are

important also for fitting higher order models. It is therefore recommended that 5 or more

are included in the design. In particular for Box-Behnken designs, ample center samples are

needed to fit a precise response surface.

316

Design of Experiments

Axial samples are used in Central Composite designs. Their coordinates often exceeds the

low or high levels defined for the variable in question, while all other variables are at the

mid-level. The additional levels are beneficial for fitting a quadratic or cubic model to the

data.

Axial samples in a Central Composite design with two design variables

Axial samples can lie on centers of cube faces or they can lie outside the cube, at a given

distance from the center of the cube. This distance can be tuned, but it is recommended to

use the default distance (for the given design) whenever possible.

Three cases can be considered:

The default axial to center point distance ensures that all design samples have

exactly the same leverage, i.e. the same influence on the model. Such a design is

said to be “rotatable”. If the number of design variables is two or four, this distance

also ensures that all factorial and design points lie with the same distance from the

center, giving a “spherical” design region. For other numbers of factors, rotatability

almost, but not quite, corresponds with a spherical design;

The axial to center point distance can be tuned down to 1. In that case, the star

samples will be located at the centers of the faces of the cube. This ensures that a

Central Composite design can be built even if levels lower than “low cube” or higher

than “high cube” are impossible. However, the design is no longer rotatable;

Any intermediate value for the star distance to center is also possible. The design

will not be rotatable.

Sample types in mixture designs

An overview of the various sample types used in mixture designs is provided below:

Axial design: vertex and axial samples, optionally end points and overall centroids;

Simplex-centroid design: vertex samples, centroids of various orders, optional

interior (axial) points;

Simplex-lattice designs: samples positioned in a regular grid (similar to multi-level

factorial samples), overall centroid.

Axial point

In a simplex design, an Axial point is positioned on the axis of one of the mixture

variables, half-way between the overall centroid and the vertex for that component.

Used in Axial designs and augmented Simplex-Centroid designs.

Centroid point

A Centroid point is calculated as the mean of the extreme vertices on a given

surface. Edge centers, Face centers and Overall Centroids are all examples of

317

The Unscrambler X Main

called the centroid order. For instance, in a four-component mixture, the overall

centroid is the fourth order centroid. Edge centers, or second order centroids, are

positioned in the center of the edges of the simplex. In Unscrambler the overall

centroid is denoted ‘Centroid’ while lower order centroids are referred to as ‘Blend’

points in Simplex-Centroid designs.

End point

In an axial design, ‘End’ points are optionally positioned at the bottom of the axis of

one of the mixture variables, and is thus on the opposite side to the axial point.

These are second order centroids and are referred to as Blend points in Simplex-

Centroid designs.

Face center

The face centers are positioned in the center of the faces of a simplex. They are also

referred to as third order centroids.

Interior point

An interior point is not located on the surface of a design, but inside the

experimental region. For example, an axial point is a particular kind of interior point.

Overall centroid

The overall centroid is calculated as the mean of all extreme vertices. It is the

mixture equivalent of a center sample.

Vertex sample

A vertex is a point where two lines meet to form an angle. Vertex samples are the

“corners” of the simplex corresponding to pure components.

Reference samples

Reference samples do not belong to a standard design, but are included for various

purposes.

Here are a few classical cases where reference samples are often used:

When trying to improve an existing product or process, the current recipe or process

settings may be used as a reference.

When trying to copy an existing product, for which the recipe is not known, one

might still include that product as reference and measure the responses on that

sample as well as on the others, in order to know how close the experimental

samples have come to that product.

To check curvature in the case where some of the design variables are category

variables, one can include one reference sample with center levels of all continuous

variables for each level (or combination of levels) of the category variable(s).

Note: For reference samples, only response values can be taken automatically into

account in the Analysis of Effects and Response Surface analyzes. Values of the

design variables may, however, be entered manually after converting to a non-

designed data table, then run a PLS analysis on the resulting table.

Replicates

Replicates are experiments performed several times under reproduced conditions. They

should not be confused with repeated measurements, where the samples are only prepared

once but the measurements are performed several times on each.

Why include replicates?

318

Design of Experiments

Replicates are included in a design in order to estimate the experimental error associated

with the system. This is doubly useful as it:

Enables a comparison of the response variation due to controlled causes (i.e. due to

variation in the design variables) with uncontrolled response variation. If the

“explainable” variation in a response is no larger than its random variation, the

variations of this response cannot be related to the investigated design variables.

The usual strategy is to specify several replicates of the center sample. This has the

advantage of both being rather economical, and providing an estimation of the experimental

error under “average” conditions.

When no center sample can be defined (because the design includes category variables only

or variables with more than two levels), one may repeat the entire set of experimental

points instead. This also provides a better estimation of the experimental error across the

design region. If it is known that there is a lot of uncontrolled or unexplained variability in

the experiments, it might be wise to replicate the whole design.

The purpose of experimental design usually is to find out how variations in design variables

influence response variations. However, no matter how well the conditions of an

experimental setup is controlled, random variations still occur. The next sections describe

what can be done to limit the effect of random variations on the interpretation of the final

results.

Randomization

Randomization means that the experiments are performed in random order, as opposed to

the standard order which is sorted according to the levels of the design variables.

Most often, the experimental conditions are likely “drift” during the course of the

investigation, such as when temperature and humidity vary according to external

meteorological conditions, or when the experiments are carried out by a new employee who

is better trained at the end of the investigation than at the beginning. It is crucial not to risk

confusing the effect of a change over time with the effect of one of the investigated

variables. To avoid such misinterpretation, the order in which the experimental runs are to

be performed is usually randomized.

Incomplete randomization

There may be circumstances which prevent the use of full randomization. For instance, one

of the design variables may be a parameter that is particularly difficult to tune, so that the

experiments will be performed much more efficiently if that parameter only needs to be

tuned a few times. Another case for incomplete randomization is blocking.

The Unscrambler® enables one to leave some variables out of the randomization. As a result,

the experimental runs will be sorted according to the non-randomized variable(s). This will

generate groups of samples with a constant value for those variables. Within these groups,

the samples will be randomized according to the remaining variables.

8.2.10 Blocking

In some situations it may not be possible to run all experiments under the exact same

conditions, or there may be other reasons to split the full set of runs into blocks that are

319

The Unscrambler X Main

performed independently from the others in some sense. A common scenario is that raw

material comes from different batches, in case there is not enough material in a single batch

to accommodate the full set of experiments. Often screening designs are extended into

factor influence studies, or factor influence studies are extended into optimization studies. If

this is performed in a planned manner, it will often be possible to re-use previous

measurements and supplement them with new ones. For instance, a low resolution

fractional factorial can be extended into a high resolution or full factorial design, which again

can be extended into a circumscribed or faced central composite design (see section

Extending a design below). Because these blocks of experiments are necessarily performed

in different points of time, there is a higher risk that non-controllable or unknown factors

differ between blocks. Whether such variation has an unwanted effect on the response

should always be investigated.

Any blocked experiment should be tested for unequal block means. For experiments where

measurements are divided into two distinct blocks, the response(s) can be tested using a

Student’s t-test for equality of means. A low p-value, or equivalently a large difference

between the plotted quantiles, indicates that there is a significant blocking effect. Any effect

confounded with blocks cannot be trusted if this is the case. Careful planning of the

experiment is required to avoid that effects of interest are confounded with, or non-

distinguishable from, blocks.

For any number of blocks the responses can be plotted in a quantiles plot, where the block

means and variances can be compared using the sample grouping option. If the distributions

of response values are similar across blocks, there is no evidence that block effects have had

an influence on the response.

Incomplete blocking of full factorial designs

If the full experiment is replicated, one should strive to include the full set of unique design

points in each block. This will ensure that any blocking effect is confounded with replicates

only, and all effects will be free of confounding with blocks. When all the treatment

combinations are included in each block, the design is referred to as a complete block design

and block effects should be tested as described above.

If this is not possible some effects will always be confounded with blocks, and the estimated

effects in question will include the block contribution as well. This is referred to as an

incomplete block design, and the efficiency of such a design depends on which effects are

confounded with blocks. Of course one would not want to create a design where any of the

main effects were confounded with blocks, as these main effects would be indistinguishable

from the block effects. Preferably the blocks should be set up such that they are confounded

with high order interactions only.

The Unscrambler® supports blocking of most full factorial experiments into 2p blocks, p being

smaller than the number of design variables. A full factorial design with three 2-level factors

may be divided into two or four blocks. A full factorial design with 3-7 2-level factors may be

split into two, four or eight blocks. The blocking generators are selected to ensure that as

many low-order interactions as possible can be estimated without confounding with blocks.

For instance, in a six-variable design divided into two blocks, the blocking effect will be

confounded with the six-variable interaction only.

In the ANOVA, all interactions confounded with blocks will be summarized in a separate

sums of squares for blocks. These individual interaction effects will not be given or tested in

the ANOVA, as they are indistinguishable from the blocking effects.

320

Design of Experiments

After a series of designed experiments has been performed, the are results analyzed and

conclusion are drawn from them, two situations may occur:

The experiments have provided all the information needed, which means that the

project is completed.

The experiments have given valuable information which can be used to build a new

series of experiments that will lead closer to the experimental objective.

In the latter case, the new series of experiments can sometimes be designed as a

complement to, or an extension of, the previous design. This allows one to minimize the

number of new experimental runs, and the whole set of results from the two series of runs

can be analyzed together.

Why extend a design?

In principle, one should make use of the extension feature whenever possible, because it

enables progression to the next stage of an investigation using a minimum of additional

experimental runs.

Extending an existing design is also a convenient way of building a new, similar design that

can be analyzed together with the original one. For example, if a chemical reaction has been

investigated using a specific type of catalyst, one might want to investigate another type of

catalyst under the same conditions as the first reaction, in order to compare their

performances. This can be achieved by adding a new design variable, namely type of

catalyst, to the existing design.

Design extensions can also be used as a basis for an efficient sequential experimental

strategy. That strategy consists in breaking the initial problem into a series of smaller,

intermediate problems and investing in a small number of experiments to achieve each of

the intermediate objectives. Thus, if something goes wrong at one stage, the losses are cut;

and if all goes well, one may end up solving the initial problem at a lower cost than if a huge

design had been used initially.

When and how to extend a design

The following text briefly describes the most common extension cases:

Add levels: Used whenever one is interested in investigating more levels of already

included design variables, especially for category variables.

Add a design variable: Used whenever a parameter that has been kept constant is

suspected to have a potential influence on the responses, as well as when one

wishes to duplicate an existing design in order to apply it to new conditions that

differ by the values of one specific variable (continuous or category), and analyze the

results together. For instance, if a chemical reaction using a specific catalyst has

been investigated, and now another similar catalyst for the same reaction will be

studied to compare its performances to the other one’s, the first design can be

extended by adding a new variable; type of catalyst.

Delete a design variable: If the analysis of effects has established one or a few of the

variables in the original design to be clearly insignificant, the power of the

conclusions can be be increased by deleting this variable(s) and reanalyzing the

design. Deleting a design variable can also be a first step before extending a

screening design into an optimization design. This option should be exercised with

321

The Unscrambler X Main

caution if the effect of the removed variable is close to significance. Also be sure

that the variable to be removed does not participate in any significant interactions.

Add more replicates: If the first series of experiments shows that the experimental

error is unexpectedly high, replicating all experiments might make the results

clearer.

Add more center samples: In order to get a better estimation of the experimental

error, adding a few center samples is a good and inexpensive solution.

Add more reference samples whenever new references are of interest. More

replicates of existing reference samples may be used in order to get a better

estimation of the experimental error.

Extend to higher resolution: Use this option for fractional factorial designs where

some of the effects of interest are confounded with each other. This option can be

used whenever some of the confounded interactions are significant and one needs

to find out exactly which ones. This is only possible if there is a higher resolution

fractional factorial design available. Otherwise, one can extend to a full factorial

design instead.

Extend to full factorial: This applies to fractional factorial designs where some of the

effects of interest are confounded with each other and no higher resolution

fractional factorial designs are available.

Extend to central composite: This option completes a full factorial design by adding

star samples and (optionally) a few more center samples. Fractional factorial designs

can also be completed this way, by adding the necessary cube samples as well. This

should be used only when the number of design variables is small; an intermediate

step may be to delete a few variables first.

conditions not represented in the design variables must be the same for the new

experimental runs as for the previous runs.

How to ensure representative new samples

As the new experiments will be exploring a new area of the design space, it is important to

be sure that there has been no drift since the first experiments have been performed.

To do so try to use at least two or three new center samples. Once the experiments are

performed run a T-test to compare the average of the first series of center samples and the

second. See the section on T-test (Introduction to statistical tests) or blocking for more

details.

How should experimental design be used in practice? Is it more efficient to build one global

design that tries to achieve the main goal, or would it be better to break it down into a

sequence of more modest objectives, each with its own design?

It is strongly advised that even if the initial number of design variables to be investigated is

rather small, use the latter, sequential approach. This has at least four advantages:

Each step of the strategy consists of a design involving a reasonably small number of

experiments. Thus, the mere size of each subproject is more easily manageable.

A smaller number of experiments also means that the underlying conditions can

more easily be kept constant for the whole design, which will make the effects of

the design variables appear more clearly.

322

Design of Experiments

If something goes wrong at a given step, the damage is restricted to that particular

step.

If all goes well, the global cost is usually smaller than with one huge design, and the

final objective is achieved all the same.

The following example illustrates an example experimental strategy. The objective is to

optimize a process that relies on six parameters: A, B, C, D, E, F. As it is not known which of

these parameters are influential, one must start at the screening stage.

The most straightforward approach would be to try an optimization at once, by building a

CCD with six design variables. It is possible, but costly (with at least 77 samples required) and

is also a risky approach (consider the impact if a wrong initial assumption was made, like a

wrong choice of ranges of variation? All experiments may be lost).

An alternative approach is described below:

First, build a fractional factorial design 2(6-2) (resolution IV), with two center samples,

and perform the corresponding 18 experiments.

After analyzing the results, it turns out that only variables A, B, C and E have

significant main effects and/or interactions. But those interactions are confounded,

so the design needs to be extended in order to know which are really significant.

The first design is extended by deleting variables D and F and extending the

remaining part (which is now a 2(4-1), resolution IV design) to a full factorial design

with one more center sample. Additional cost: nine experiments.

After analyzing the new design, the significant interactions which are not

confounded only involve A, B and C. The effect of E is clear and goes in the same

direction for all responses. But since the center samples show some curvature, one

must proceed to the optimization stage for the remaining variables.

Thus, variable E is kept constant at its most interesting level, and after deleting that

variable from the design, the remaining 2³ full factorial design is extended to a CCD

with six center samples. Additional cost: nine experiments.

Analysis of the final results yielded a desired optimum point. Final cost: 18+9+9=36

experiments, which is less than half of the initial estimate.

Simple data checks and graphical analysis

Any data analysis should start with simple data checks: use descriptive statistics, check

variable distributions, detect out-of-range values, etc.

For designed data, this is particularly important: one would not want to base a test of the

significance of the effects on erroneous data!

The good news is that data checks are even easier to perform when experimental design has

been used to generate the data. The reason for this is twofold:

If the design variables have any effect at all, the experimental design structure

should be reflected in some way or other in the response data; graphical analysis

and PCA will visualize this structure and help one detect abnormal features.

The Unscrambler® includes automatic features that take advantage of the design

structure (grouping according to levels of design variables when computing

323

The Unscrambler X Main

descriptive statistics or viewing a PCA scores plot). When the structure of the design

shows in the plots (e.g. as subgroups in a box-plot, or with different colors on a

scores plot), it is easy to spot any sample or variable with an illogical behavior.

The ANOVA table is a powerful tool to assess how well the model fits individual responses. It

has a Summary section that provides information about the overall significance of the

model. This is followed by a Variables section providing information about the importance of

the different design variables and their interactions. A Model Check section divides the total

variance into variability explained by terms of different order. For factorial and lower order

CCD models, all effects are orthogonal, meaning that e.g. the effect of linear terms equals

the sum of individual contributions.

Mixture designs are not orthogonal, and variances are therefore no longer additive. For

these designs, the Variables section provides the so-called marginal (Type III) sums of

squares (SS), reflecting the difference in SS between the full model and a model with the

effect in question left out. In contrast, the model check section provides the sequential

(Type I) SS, reflecting the increase in model SS when higher order terms are added to the

design. The model check section can be used to decide the optimal complexity of the

mixture model. Higher order terms should not be included unless they contribute

significantly to the model fit.

There is a Lack of Fit section that compares the experimental uncertainty (pure error) with

the residual variability due to inadequate modeling of the data (lack of fit). The pure error is

estimated based on replicated measurements of center samples. A significant lack of fit is an

indication that additional terms may improve the model. At the bottom of the ANOVA table,

there is a section with different model quality estimates such as calibration and prediction

R², prediction error sums of squares (PRESS), etc. The PRESS value reflects the error variance

when each observation is left out from the calibration model once and subsequently

predicted. It reflects the predictive ability of the model and is therefore a conservative

estimate of how good the model is. A PRESS value close to (or higher than) the corrected

total SS means very low predictive ability and will give an ‘R-square prediction’ value close to

zero (or negative). R-square prediction closer to 1.0 means that the predictive ability is good

and the PRESS value is correspondingly small.

The analysis sequence is then to first look at the model p-value and R². A p-value below 5%

indicates a good model fit and a R² close to 1 indicates a good correlation between the

predicted response value and the actual response value. Consideration must then be given

to the value of the individual effects or model terms and their sign. Consideration should

also be given to the corresponding p-values. Each effect with a p-value < 5% is considered

significant; if the p-value is < 1% it is highly significant. A p-value between 5 and 10%

indicates a marginally significant effect. A p-value > 10% indicates that an effect is not

considered to be significant.

ANOVA table

Sum of Squares Degree of Freedom Mean p-

F-ratio

(SS) (DF) Square value

Summary

Error 12 4 3

324

Design of Experiments

F-ratio

(SS) (DF) Square value

Variables

In this example the model is valid (p-value=0.0001) and all effects are significant (p-values <

0.05). The most significant effect is B as it has the smallest p-value.

Note: A saturated design is a design in which the number of experimental runs

equals to the number of model terms (including offset if necessary). This type of

design uses all the degrees of freedom to calculate the model terms, the error SS is

zero and p-values will not available.

Some assumptions underlying the ANOVA need to be verified before the test results can be

fully trusted. The first assumption is that the observations are adequately described by the

model. The model is defined by the included effects, and the best way to validate the model

is to apply it on left-out observations and see how well the predicted and measured

responses correspond with each other. A low PRESS value, or correspondingly an ‘R-square

prediction’ close to one, is an indication that the first assumption holds.

Also, the errors should be normally and independently distributed with mean zero and

constant but unknown variance. An important step of the analysis is therefore to plot the

residuals in different representations. In short, no obvious structures or patterns should be

found in the residuals when these assumptions are met.

The normality assumption is checked by looking at the residual histogram or normal

probability plot. The first should ideally look like the bell-shaped probability density of the

normal distribution centered at zero. Samples displaying strong deviation from the normal

distribution will be detected as deviating from a straight line in the normal probability plot of

residuals. This plot can therefore also be used as an outlier detection tool. Note that if the

number of observations is small, even perfectly random residuals will deviate somewhat

from the ideal bell-shaped density function. Luckily, the significance tests are robust to

moderate departures from normality.

The independence assumption can be verified by plotting the Y-residuals in experimental

order. The reason for randomizing the experimental order of runs is to avoid that time

dependent variations are influencing the estimation of effects. Correlation between

residuals, however, indicates that the runs have not been independently measured, which

may seriously affect the validity of the results. Also the Y-residuals vs. Y-predicted plot

should be studied to see whether any obvious patterns are found. Independent residuals will

appear as random variations in these plots.

Both the Y-residuals in experimental order and the Y-residuals vs. Y-predicted plots can also

be studied to check the constant variance assumption. Use these plots to see whether the

spread of observations is larger in one end compared to the other. A funnel or cone shape of

the experimental points indicates that some measurements are more precise than others, or

equivalently that some measurements have a larger influence on the model than others. If

325

The Unscrambler X Main

the variance is strongly associated with the magnitude of the response, a variance-stabilizing

transform such as log(Y), Y1/2, or 1/Y might be considered (Tip: Histograms can be used to

test the influence on the response of different transforms). If the precision of runs improves

somewhat in the course of the experiment, a model based on randomized runs will most

likely be robust to these changes.

Note that if there are very few residual degrees of freedom left after estimating all the

effects in the model, artificial structure in the residuals can be expected simply due to lack of

information in the data. In the extreme case that the residual degrees of freedom is zero, all

the residuals will be zero as well. If a little more than the minimum number of experiments

can be afforded, this will benefit the interpretation of results.

Analysis of effects using classical methods

An analysis of the effects is usually performed for screening and factor influence designs:

Plackett-Burman, Fractional Factorial, Full-Factorial designs. These designs allow estimation

of main effects and some of them also 2-3 variable interactions.

The classical DoE analysis method for studying effects is based on the ANOVA-table. Main

effects or interactions found to be important in the ANOVA table can be investigated further

in an effects visualization plot. This will reveal the direction and magnitude of the individual

effects. It is important to note that even if a main effect seems to be irrelevant, the factor

can still have a large impact on the model if it takes part in a significant interaction effect.

Other checks that can be applied after analyzing the ANOVA table include the detection of

curvature effects. These can be found by plotting the main effects plot. If a nonlinear trend

is detected when checking the position of the center sample, one may consider a possible

curvature effect and include the square term of the effect in the model.

Main effect plot with curvature

When a variable is categorical, it is necessary to check which effects are significant and also

if they are significantly different. The multiple comparison test provides this type of

information. It is based on a comparison of the averages of the response variable at the

different levels. If the difference between two averages is greater than the critical limit the

two levels are significantly different. If not they have a similar effect. If no level has an effect

all levels will have a statistically similar effect, and the averages for the response variables at

the different levels will be non-significantly different.

326

Design of Experiments

In The Unscrambler®, there are three specific outputs for the multiple comparison test:

A table of distances, that gives the two-by-two distance between the levels.

A group table, that indicates the different grouping between the levels.

A plot displaying the levels in their group.

Response surface analysis using classical methods

A response surface analysis is very useful when the experimental objective is optimization.

This is often the case for Central Composite and Box-Behnken designs as well as Mixture

designs.

The classical DoE method of analysis for studying a response surface is to fit a quadratic (or

even a cubic) model by MLR. For mixture designs, a special type of MLR models called

Scheffé models are used, which do not include an offset parameter.

The ANOVA table is still the main tool to assess the significance of effects. The significance of

individual effects as well as two-variable and three-variable interactions, square and cubic

terms must be assessed, depending on the terms included in the analysis.

The available models for BB designs are:

Main effects

Main effects + interactions (2-variable)

Main effects + interactions (2-variable) + quadratic terms

Main effects + interactions (2-variable) + quadratic + cubic terms

Main effects + interactions (2- and 3-variable) + quadratic terms

Main effects + interactions (2- and 3-variable) + quadratic + cubic terms

Second order (quadratic),

Special cubic. This is similar to main effects + interactions (2- and 3-variable).

However as the model has a closure constraint quadratic terms are partially

included.

Full cubic. This is similar to main effects + interactions (2- and 3-variable) + quadratic

terms.

The above lists correspond with pre-defined alternatives, and it is possible to remove terms

from any of these models in a hierarchical manner (except linear mixture terms, which

cannot be removed).

The response surface can be used to find optimal design settings. For CCD and BB designs,

one fitted response are plotted for the entire area spanned by two design variables, any

remaining variables held constant at its minimum level. Maxima, minima, saddle points or

stable regions can be detected by changing which variables to plot while varying the levels of

327

The Unscrambler X Main

the remaining variables. For mixture designs, the plotted design region consists of three

mixture components forming a simplex/triangle.

More information on how to vary the condition can be found in the RS table section in the

plot interpretation page.

Response surface

Limitations of ANOVA

Analyses based on MLR/ANOVA are very useful for orthogonal designs or mixture designs

where one or two (non-related) responses have been measured accurately following the

experimental conditions. ANOVA has some important shortcomings, however:

The underlying MLR is based on the assumption that all variables can be measured

independently of all other variables in the model. This is always the case for

orthogonal designs such as the factorial designs. For some designs, such as

optimization designs including quadratic terms, mixture designs, D-optimal designs

or for any design where some experimental measurements are missing, some of the

model terms (effects) will become more or less correlated. If two correlated terms

both have an influence on the response, one of these will often (arbitrarily) come

out as significant at the expense of the other. While the ANOVA will automatically

handle standard designs such as mixture designs of simplex shape, a bilinear method

such as PLSR can take into account any number of correlated variables.

If several responses are modeled, the MLR will fit a model to each response

independently. If all responses are orthogonal, one can then assess the ANOVA table

for each response without taking the remaining responses into account. The

problem is that real data are seldom or never orthogonal. For any two sufficiently

correlated responses, it is sub-optimal to try to assess the effects on one

independently from the other, and trying to find the main conclusions from several

ANOVA tables together is difficult in itself. A bilinear method such as PLSR can take

into account any number of correlated responses, and any relationships between

responses and descriptors will be easily detected.

The reliability of the p-value estimates in the ANOVA table highly depends on the

residual degrees of freedom (DF) in the data after estimating all the parameters of

328

Design of Experiments

the model. If the error DF is low, the reliability of the estimated p-values is low as

well. This also limits the ability to check the assumptions of the model. When

several, correlated effects are estimated, the MLR consumes more DF than the true

number of underlying, independent effects. In contrast, with the bilinear methods

such as PLSR, the user estimates the optimal model rank based on the predictive

ability of the model.

In the ANOVA table, the predictive ability of the model is given by the ‘PRESS’ and

‘R-square prediction’ values. These are based on leverage corrected residuals, which

in the case of MLR is identical to residuals obtained from a leave-one-out (LOO)

cross-validation. This reflects the ability of the model to predict each measurement

based on models fitted using all samples except the one in question. If some

samples are replicated, the LOO procedure will be overly optimistic. If there are for

instance 3 center samples in total, these will be predicted based on models where

the 2 remaining center samples have been accounted for. The prediction error will

therefore be smaller than if all center samples were kept out in the same step. In

general, all replicated measurements of any experimental point should be kept out

in a single cross-validation segment to ensure conservative error estimates.

Non-controllable variables, i.e. variables that are believed to have an effect on the

responses but that are difficult to control at the required level of precision, are

currently not included in the ANOVA. In general, an attempt to include many of

these variables in an MLR model will have a high expense in terms of residual DF,

and the above considerations about correlation between terms would also have to

be taken into account. In PLSR any number of non-controllable variables can be

included, and they can optionally be downweighted in order to discover their

influence on the data without actually allowing them to influence the model. If e.g.

the run order was mixed up in the experiment, a passive descriptor giving the run

order or time-points of the individual measurements will reveal if any effects are

aliased with a time effect.

Analysis with PLS Regression

If some or all of the considerations above make analysis by ANOVA difficult, PLSR can always

be used as a powerful alternative. To get a refresher on the theory of PLSR follow this link.

Include all design variables including any interactions, quadratic or cubic effects of interest in

the descriptor ( ) matrix. Any additional non-controllable variable, background

information about the samples, experimental details such as time of measurement, batch, or

change of instruments can be included here as well. Include all response variables. Weight

all variables with 1/SDev, or optionally downweight some of the descriptors.

Validate with cross-validation. The level of validation depends on the cross-validation

segments. If e.g. all experimental runs are replicated once, the replication error can be

assessed by leaving out a full set of experimental runs in two cross-validation segments.

Note that this will not tell you how well the model will predict new samples but rather it will

reflect the experimental error in the experiment. In order to estimate how well the model

predicts new measurements (when level combinations are allowed to vary within the design

region), keep out all replicates of each point once. This will be a more conservative and

correct estimate for the predictive power of the model.

Include the uncertainty test to get an estimate of the significance of the effects. The

following are important tools to interpret the model and make conclusions:

Weighted Beta coefficients with their uncertainty limit

329

The Unscrambler X Main

The weighted B-coefficients are used to determine which effects are the most

important and their direction of influence. Effects with high positive or negative

regression coefficients have a larger influence on the response in question.

The uncertainty test shows which effects are significantly non-zero, averaged over

responses. Coefficients with high absolute values and little variation across cross-

validation segments will point to significant effects.

Estimated p-values

The uncertainty test will estimate p-values for all effects and interactions included in

the PLSR model. These are based on the size and stability of the PLSR regression

coefficients in the cross-validation.

Explained variance

This plot will reveal the optimal number of components in the model, its fit (blue

line) and predictive ability (red line). The optimal number of components

corresponds with the number of independent phenomena in the data that exceeds

the noise level of the measurements.

Correlation loadings

The loadings or loading weights will reveal the main dependencies between

descriptors and responses in two dimensions. Often these dimensions will capture

the majority of the co-variation between descriptors and responses.

The correlation between the factors and each original variable is captured by the

distance from the origin in the correlation loadings plot. Even downweighted

variables are easily mapped in these plots.

Outlier detection

The sample outlier or influence plots can reveal erroneous measurements or typos

that should be mended or removed.

Predicted vs. Reference

Used to assess the model’s goodness of fit (blue points) and predictive ability (red

points) for each response variable, look for deviating runs and assess prediction

statistics.

When data are missing or experimental conditions have not been reached

In a real life situation it is not always possible to reach the target for the experimental

conditions or an experiment may not go as planned. In such cases one cannot apply the

classical DOE analysis methods. In these situations one can use a PLS fitting method. The

validation procedure of the PLS by jack-knifing will provide approximate p-values for the B-

coefficients, see above chapter on Analysis with PLS regression.

More information on PLS regression can be found in the chapter on Partial Least Squares

In the following section, a few tips that might come in handy when building a design or

analyzing designed data are presented.

Choosing which variables to investigate is the first step in designing experiments. That

problem is best tackled during a brainstorming session in which all people involved in the

project should participate, reducing the likelihood of overlooking an important aspect of the

investigation.

330

Design of Experiments

For a more extensive screening, variables that are known not to interact with other variables

can be left out. If those variables have a negligible linear effect, one can choose a constant

level for them (e.g. the least expensive). If those variables have a significant linear effect,

they should be fixed at the level most likely to give the desired effect on the response.

The previous rule also applies to optimization designs, if it is known that the variables in

question have no quadratic effect. If it is suspected that a variable can have a nonlinear

effect, it should be included in the optimization stage.

Once the variables to be investigated have been defined, appropriate ranges of variation

remain to be established.

For screening designs, one is generally interested in covering the largest possible region. On

the other hand, no information is available in the regions between the levels of the

experimental factors unless it is assumed that the response behaves smoothly enough as a

function of the design variables. Selecting the adequate levels is a trade-off between these

two aspects.

Thus a rule of thumb can be applied: Make the range large enough to give an effect and

small enough to be realistic. If it is suspected that two of the designed experimental runs will

give extreme, opposite results, perform those first. If the two results are indeed different

from each other, this means that enough variation has been generated. If they are too far

apart, and too much variation has been generated, the ranges should be decreased some. If

they are too close, try a center sample; as they might just have a very strong curvature!

Since optimization designs are usually built after some kind of screening, one should already

know roughly in what area the optimum lies. So unless a CCD is being built as an extension

of a previous factorial design, one should try to select a smaller range of variation. This way

a quadratic model will be more likely to approximate the true response surface correctly.

Analysis of effects and response surface modeling, which are specially tailored for

orthogonally designed data sets and are ideally run if response values are available for all

the designed samples. The reason is that those methods need balanced data to be fully

applicable. As a consequence, one should exercise great care when collecting response

values for all experiments. If a measurement is lost, for instance due to some instrument

failure, it might be advisable to redo the experiment later to collect the missing values.

If, for some reason, some response values simply cannot be measured, one can still to use

the standard multivariate methods available in The Unscrambler®: PCA on the responses,

and PCR or PLSR to relate response variation to the design variables.

This section focuses on more technical or “tricky” issues related to the computation of

constrained designs.

In a mixture situation where all concentrations vary from 0 to 100%, it was shown in the

mixture design section that the experimental region has the shape of a simplex. This shape

reflects the mixture constraint (sum of all concentrations = 100%).

331

The Unscrambler X Main

Note: If some of the ingredients do not vary in concentration, these are left out

from the mixture equation such that the ‘total amount’ refers to the sum of the

remaining mixture components. For instance if one wishes to prepare a fruit punch

by blending varying amounts of watermelon, pineapple and orange juice, with a

fixed 10% of sugar, the mixture components sum to 90% of the juice blend but to

100% of the ‘total amount’ (mixture sum). This ensures that the three mixture

components will span a 2-dimensional simplex that can be modeled by a regular

mixture design.

Whenever the mixture components are further constrained, like in the example shown

below, the mixture region is usually not a simplex.

With a multilinear constraint, the mixture region is not a simplex

In the absence of multilinear constraints, the shape of the mixture region depends on the

relationship between the lower and upper bounds of the mixture components. It is a simplex

if for each mixture component, the upper bound + the sum of lower bounds for the

remaining components equals 100% (the total amount).

The figure below illustrates one case where the mixture region is a simplex and one case

where it is not.

Changing the upper bound of watermelon affects the shape of the mixture region

In the leftmost figure, the upper bound of watermelon is 100% - (17% + 17%) = 66%, and the

mixture region is a simplex. If the upper bound of watermelon is shifted to 55% as in figure

to the right, this value will be smaller than 100% - (17% + 17%) and the mixture region is no

longer a simplex.

Note: When the mixture components only have lower bounds, the mixture region is

always a simplex.

In a mixture situation, it is important to notice that variations in the major constituents are

only marginally influenced by changes in the minor constituents. For instance, an ingredient

varying between 0.02% and 0.05% will not noticeably disturb the mixture total; thus it can

be considered to vary independently from the other constituents of the blend.

This means that ingredients that are represented in the mixture with a very small proportion

can in a way “escape” from the mixture constraint.

332

Design of Experiments

So whenever one of the minor constituents of a mixture plays an important role in the

product properties, one can investigate its effects by treating it as a process variable.

A special case occurs when all the ingredients of interest have small proportions. Consider

the following example: a water-based soft drink consists of about 98% of water, an artificial

sweetener, coloring agent, and plant extracts. Even if the sum of the “non-water”

ingredients varies from 0 to 3%, the impact on the proportion of water will be negligible.

It does not make any sense to treat such a situation as a true mixture; it is better addressed

by building a classical orthogonal design (full or fractional factorial, central composite, Box-

Behnken, depending on the design objectives).

There are various types of constraints on the levels of design variables. At least three

different situations can be considered.

Some combinations of variable levels are physically impossible. For example: a

mixture with a total of 110%, or a negative concentration.

Although the combinations are feasible, they are not relevant, or they will result in

difficult situations. Examples: some of the product properties cannot be measured,

or there may be discontinuities in the product properties.

Some of the combinations that are physically possible and would not lead to any

complications are not desired, for example the cost of the ingredients may be

prohibitive.

During the define stage of a new design, give careful attention to any constraint that may be

introduced. An unnecessary constraint will not help solve the problem faster; on the

contrary, it will make the design more complex, and may lead to more experiments or

poorer results.

Design constraints

The first two cases mentioned above can be referred to as design constraints

because they should be included in the design itself. They cannot be disregarded

because if they are, one will end up with missing values in some of the experiments,

or uninterpretable results.

Optimization constraints

The third case can be referred to as an optimization constraint. Whenever

considering introducing such a constraint, examine the impact it will have on the

form of the design. If it turns a perfectly symmetrical situation, which can be solved

with a classical design (factorial or classical mixture), into a complex problem

requiring a D-optimal algorithm, it may be better to disregard the constraint.

For the third situation, build a standard (orthogonal or mixture) design and take the

optimization constraint into account afterwards, at the result interpretation stage. For

instance, a constraint on one or multiple design or response variables can be added to a

response surface plot, and the optimum solution selected within the constrained region.

This also applies to upper bounds in mixture components. As mentioned in the section on Is

the Mixture Region a Simplex?, if all mixture components have only lower bounds, the

mixture region will automatically be a simplex. It is important to keep this in mind so to

avoid imposing an upper bound on a constituent playing a similar role to the others. Expense

of material (thereby limiting its usage to a minimum) should not be considered an option for

333

The Unscrambler X Main

an important study. This can be done at the interpretation stage, where the mixture that

gives the desired properties with the smallest amount of that constituent is chosen.

A new design is created by using the menu Insert – Create design…, which will open the

Design Experiment Wizard. This dialog contains a sequence of tabs, where the next tab

content often depends on the input in the previous tab.

General buttons

Start

Define Variables

Choose the Design

Design Details

Plackett-Burman designs

Fractional factorial designs

Full factorial designs

Full factorial designs without blocking

Full factorial designs with incomplete blocking

D-optimal designs

D-optimal designs including mixture constraints

Central Composite and Box-Behnken designs

Mixture designs

Simplex mixture designs

Non-simplex mixture designs and process+mixture designs

Additional Experiments

Randomization

Summary

Design Table

Cancel

At any time it is possible to exit the Design Experiment Wizard and go back to The

Unscrambler® main interface by pressing the Cancel button.

Finish

At the bottom of each tab, the Finish button is located. Initially this is disabled:

When sufficient information has been entered into the tab, the Finish button is made active:

By pressing this button all tasks in the design wizard are ended and the design is created in

The Unscrambler® navigator.

8.3.2 Start

The first tab in the sequence is divided in four sections:

Name

Goal

334

Design of Experiments

Description

History

Start tab

Name

By default the design will be named “MyDesign”. You may change this to the name you

would like the design to have in the project navigator later.

Goal

Select the most appropriate goal of the experiment. Based on this selection and the

number/type of design variables, the wizard will propose a suitable design.

Screening

In a screening experiment the goal is to isolate design variables that have a

significant main effect on the response variable(s).

When selecting this goal, the Design Experiment Wizard will favour either a Plackett-

Burman design or a low resolution Fractional Factorial design, provided the design

variables are not under any constraints. For mixtures an Axial design will be

suggested, and a low number of samples will be suggested if a D-optimal design is

selected.

Screening with interaction

In a screening with interaction experiment (often referred to as a factor influence

study) the goal is to assess both the main effects and the interactions of the design

variables on the response variable(s).

When selecting this goal, the Design Experiment Wizard will favour either a higher

resolution (IV or V) Fractional Factorial or a Full Factorial design, provided the

designed variables are not under any constraints. For mixtures a Simplex Lattice

design will be suggested, and the default terms and number of samples for a D-

optimal design will be adjusted accordingly.

Optimization

335

The Unscrambler X Main

When choosing optimization as the goal, the design investigates main effects,

interactions and square terms on the response variable(s).

By choosing optimization as the goal, the Design Experiment Wizard will favour

either a Central Composite or Box-Behnken design, provided the designed variables

are not under any constraints. The suggested mixture design will be a Simplex

Centroid design, and the number of terms and samples for a D-optimal design will

be higher.

variables to be investigated it is necessary to break down the design strategy into

two stages:

Find the optimum levels for category variables (include the possible non-

category variable that can interact with them).

Find the optimum for the non-category variables using the optimized values

for the category variables.

Description

Edit the blank section to store information on the design and specific details about the

experiments.

History

This part contains information on the history of the design such as the creator, the date of

creation and possible revisions. It is auto-generated by the Design Experiment Wizard.

In this tab, define the design space as well as other variables such as the response variables

and the non-controllable variables.

It is divided into two sections:

Variable table, which displays the defined variables.

Variable editor, which allows the addition of new variables or the deletion/editing of

previously defined variables.

Define variables tab

336

Design of Experiments

Variable table

This table contains information on all the variables to be included in the experiment. The

variables are ordered as follows:

Response variables

Non-controllable variables

The variables can be re-ordered within their category by using Ctrl+arrow up or down.

To edit a variable, highlight the corresponding row, modify the information in the variable

editor,and click OK.

To delete a variable, highlight the corresponding row and click the Delete button.

Variable editor

Click the Add button to add a new variable.

Specify the characteristics of the new variable as follows:

ID

The identity of the variable will be auto-generated. Design variables will have upper

case IDs (A-Z, except reserved letter I), response variables will have integer IDs, and

non-controllable variables will have lower case IDs (a-z, except i). Design variables

no. 26 and onwards are denoted A1, B1, etc.

Name

Enter a descriptive name in the Name field. If nothing is added here, the ID will be

used as name.

Type

Select the variable type by from the following list using the radio buttons:

Response: Measured variables assumed to depend on the levels of the

design variables.

337

The Unscrambler X Main

an effect on the design. They can be measured for the purpose of including

them in a regression model.

Constraints

Select the appropriate constraint setting for the variable (by default no constraints):

example , they should be defined as having linear

constraints.

Mixture: If at least three variables are part of a mixture, they may be defined

as having a mixture constraint. This implies that the sum of all mixture

components equals the Mixture Sum (100%).

Type of levels

The levels are either continuous or category:

means that it is possible and that it makes sense to rank the levels with

respect to each other. For example high level is larger than low level and

values in between the upper and lower level exist. Only two levels are

specified for continuous variables.

Use Category if the variable can change between 2 or more distinct levels or

groups, but where one group/level cannot be ranked on a numerical scale in

relation to the others. For instance the level ‘apple’ cannot be ranked as

higher/lower/better/worse than level ‘pear’. Similarly it is not possible to

calculate an average level between category groups. Two or more levels can

be defined for category variables (max. 20). If category variables of more

than two levels are included, the only available design will be the Full

Factorial (without blocking).

Note: Never define a numeric variable as category in order to enable more levels in

the design. These are interpreted differently and the analysis will be wrong. For

optimization designs that require more than two levels to fit a response surface,

additional levels will be added later based on the defined high and low levels.

For continuous variables: place the bounds of the design space with the low

and the high values in the Level range field. By default the levels are -1 and

1 (or 0 and 100 for mixture variables)

For category variables: the Levels section makes it possible to edit the

numbers and names of the level. The default values are “Level 1” and “Level

2”.

Units

Specify any unit for the variable in question. For mixture variables the default unit is

’%’.

338

Design of Experiments

Mixture Sum

(Available for mixture variables only.) This is the sum of all mixture components in

the blend. The default value is 100 (%), but any positive value is allowed.

Different designs can be created depending on the:

Number of variables

Constraints on the variables

Goal of the experiment.

The Unscrambler® suggests the most appropriate design following some rules. Use the

radio-buttons to select a different design than the suggested one. Note that there are

limitations on which designs can be selected based on the number and type of design

variables, however the goal of the experiment can be overridden by the user. The suggested

design remains displayed in bold.

When a full factorial design is selected, a check-box is used to enable (incomplete) blocking.

Select blocking in cases where groups of experimental runs have to be performed under

different settings. For instance if one batch of raw material is insufficient for the full

experiment, different batches will have to be used for different runs. Blocking ensures that

any potential batch effect will not be confounded with other important effects such as main

effects.

In Beginner mode, the design description is intuitive for those not experienced with DoE. In

Expert mode, select the design by choosing the actual design name.

It is possible to change the view by using the Beginner/Expert cursor

.

Choose the design tab in Beginner mode

339

The Unscrambler X Main

Information

The information box provides information on the selected design.

The Design Experiment Wizard will always suggest a design taking into account 3 pre-defined

criteria:

Goal

Number of variables

Constraints on the variables

In situations where no constraints are applied:

If the goal is Screening and # of variables ≥ 11, then a Plackett-Burman design is

selected.

If the goal is Screening and # of variables > 2 and < 7, then a fractional factorial

design of resolution III is selected.

If the goal is Screening with interaction and # of variables > 4, then a fractional

factorial design is selected. Make sure to select a resolution IV design or higher.

If the goal is Screening with interaction and # of variables ≤ 4, then a full fractional

design is selected.

If the goal is Optimization and # of variable ≤ 6, then a Central composite design is

selected.

If the goal is Optimization and # of variable > 6, this is not possible as too many

experiments are required to be practically feasible. The optimization should be

performed in steps.

In the situation where Mixture constraints are applied:

At least 3 mixture variables have to be defined. If the experiment contains mixture

variables only, a mixture design will be suggested by default. Depending on the

340

Design of Experiments

defined goal: Screening selects an axial design, Screening with interaction selects a

Simplex-Lattice design and Optimization selects a Simplex-centroid design.

If additional constraints on the mixture components are imposed, the design region

might be non-simplex. Also, if process (i.e. non-mixture) variables are included

together with the mixture components, regular mixture designs cannot be used. The

appropriate choice for these setups is a D-optimal design.

In the situation where linear constraints are applied, for non-simplex mixture

designs, or for designs containing both process and mixture variables:

The appropriate choice is a D-optimal design. Designs with less than two process

variables or at least three mixture variables are not allowed.

This tab is allows a user to define the details of the various designs.

Plackett-Burman designs

When a Plackett-Burman design is selected, the Design Details tab displays a list of design

variables and a summary of the size of the design.

Design Details: Plackett-Burman

For a fractional factorial design there may be several possible resolutions corresponding the

available confounding patterns.

To change the resolution and the confounding pattern, there are two options:

Use the drop-down box to select among the available number of design points

Change the resolution with the radio buttons.

341

The Unscrambler X Main

The confounding patterns for the selected design is displayed in a separate box. They can be

visualized using the variable ID in the form : A + BC, or using the names of the variables. To

see the variable names, tick the box Show names.

After finishing a fractional factorial design, the resolution and confounding patterns will be

given in the Info box below the project navigator.

Full factorial designs

The Design Details tab looks different depending on whether blocking was selected in the

previous tab.

Details about the design variables and number of experiments are shown.

Design Details: Full factorial without blocking

342

Design of Experiments

When blocking is selected, the available number of blocks (per design replicate) is selected

in the Number of blocks drop-down box.

Depending on the number of blocks, the Block Generators are displayed in a separate

frame. These are given capital letter ID’s similar to the design variables, but they are dummy

variables used for blocking only. They are named Generator_1, Generator_2, etc.

Design Details: Full factorial with blocking

The blocking generators, as well as all their confounding interactions, will be treated

separately from the remaining effects in the subsequent ANOVA. This means that no results

will be returned for any effects confounded with blocks. The Patterns frame allows

identification of the effects confounded with blocks.

After finishing a full factorial design with incomplete blocking, the block confounding

patterns will be given in the Info box below the project navigator.

D-optimal designs

This design type corresponds to variables with constraints applied, such as:

Mixture variables with upper bounds that result in a non-simplex design region

A combination of mixture and process variables.

Set interactions and squares

Edit the design settings

Generate the design

343

The Unscrambler X Main

Note:

design.

Defining both mixture and process variables automatically leads to a D-

Optimal design.

No multilinear constraints can be defined including category variables.

The Multilinear constraints frame include a window where all the design constraints are

displayed as well as an Edit button. Clicking this button will open a dialog where multilinear

constraints can be added, edited or removed.

Editing multilinear constraints

To add a new constraint, use the button Click to add new constraint. A list of all design

variables that are defined to have either Linear or Mixture constraints will be available for

344

Design of Experiments

editing. Select a multiple of each constrained variable, or set a variable to 0 if it is not part of

the current constraint.

The operator to be used in the multilinear constraint is selected from the drop-down list:

The ’<’ and ’>’ operators are convenience functions only. On setting up the candidate points

the ‘<=’ and ‘>=’ will used instead, but with the target value modified down or up by 0.01

compared to the specifed target. After specifying the target value, the new constraint will be

added to the Current constraints box.

Repeat the above procedure for adding additional constraints, or edit an existing constraint

by clicking on the relevant box in Current constraints.

If mixture variables are included in the design, a constraint that they sum to 100% (as given

by the Mixture sum), is added automatically. This constraint cannot be edited or removed.

To delete a constraint select it in the Current constraints table and click on the Delete

button.

Click OK when all of the desired constraints have been added. The constraints will then be

tested if they are both active and consistent.

An inactive constraints is one that is superfluous because it does not constrain the design

region as specified by the variable levels. If for instance the ranges of A and B are both [0

10], a constraint that A+B>=0 will be inactive.

Inactive constraint warning

levels. A constraint that A+B>=30 for the above design will be inconsistent, because the sum

of A and B at their maximum levels is 20.

Inconsistent constraint warning

all constraints are valid, click OK again to close the dialog. All specified constraintswill then

be listed in the main dialog window.

345

The Unscrambler X Main

Any D-optimal design will include the main effects of all design variables as a minimum. In

addition some types of interaction and square terms are available depending on the type of

design variables included. These are

Second order mixture: These are all 2-variable interaction terms between the

mixture components;

Process interactions: These are all 2-variable interaction terms between the process

variables;

Process squares: These are all quadratic terms of the process variables;

Mixture and process interactions: These are all interactions of the first order mixture

terms with any first or second order process term.

Check the appropriate boxes to pre-select any of these groups of terms. For designs with

process (non-mixture) variables only, use the following guidelines:

square terms

Screening with interaction: the model should include the process interaction terms.

Optimization: the model should include the process interactions as well as the

process squares.

For mixture designs, include second order mixture terms if the goal is Screening with

interaction or Optimization.

For process/mixture designs it may be useful to optimize either the process or mixture

variables, while sampling for the main effects only of the remaining group. It is also possible

to include the second order terms for both types of variables while not including interactions

between the two. By assuming that there are no interactions between the process and

mixture variables, the number of experiments can be greatly reduced.

For a more specific selection of model terms click the Modify button. This will bring up a

dialog listing all higher order terms available for selection. The selected effects are listed in

the left box and the non-selected effects are listed in the right box. All main effect terms

(and offset if non-mixture design) are included by default and will not be listed. Any second

order mixture, process interaction and process square terms will be available for selection.

Any mixture and process interaction terms will be available for selection only if this box is

checked in the Model terms frame.

Dialog for selection of interaction and square terms

346

Design of Experiments

The Add and Remove buttons can be used to move highlighted terms from

one box to the other. The Add All and Remove All buttons do the same for all available

terms. The Add Int button adds all second order mixture as well as process interaction terms

to the model, whereas Add Square moves all process square terms to the Selected Effects

box. Click OK to keep the changes or Cancel to discard them. If some but not all of the terms

of a given order are selected, the corresponding check-box will be in a full state

(intermediate between checked and empty states).

Edit the design settings

The total number of design points is divided between a number of D-optimal design points,

space filling points and additional center points. The default sum of D-optimal and space

filling points is given by the number of model terms and the Goal of the experiment. An

offset is included in the model terms only if no mixture components are specified.

If Goal=Screening, three points more than the number of model terms is suggested,

and three additional center points.

If Goal=Screening with interaction, six points more than the number of model terms

is suggested,and four additional center points.

If Goal=Optimization, nine points more than the number of model terms is

suggested, and five additional center points.

The minimum number of design points is the same as the minimum number of D-optimal

points. These are limited by the number of model terms.

The maximum number of design points is the same as the maximum number of D-optimal

points, which is limited by the number of candidate points. As the candidate points are

generated only when the Generate button is pressed, a warning will be given if too many

design points are specified.

The minimum number of space filling and additional center points are zero. Note that the

candidate points list will contain one center point which might be added even though the

number of additional center points is set to zero.

Change the default number of center points in the Additional Experiments tab. Note that the

center sample coordinates will be calculated (or re-calculated) only when the Generate

button is pressed.

347

The Unscrambler X Main

An Advanced Design Settings dialog opens when clicking the More button. Three settings are

tuned in this window

Number of initial tries: There is no guarantee that a single run of the D-optimal

algorithm will return the globally optimal set of design points. To avoid getting stuck

in local optima the algorithm can be run multiple times using different starting

conditions. Only the result with highest D-optimality is returned. The default number

of initial tries is 5, and this value can be changed between 1 and 1000.

Random points in the initial sets: To speed up the algorithm the starting set is not

completely random. Rather a smaller random set is used and points are added

sequentially to maximize the D-optimality of the starting design. The number of

random points in the initial sets can be tuned between the the number of model

terms and the specified number of D-optimal points.

Max number of iterations: Here you can set an upper limit on the number of point

exchange operations that will be performed. The default limit is 100, the lower limit

is 10 and the upper limit is 1000 iterations. You may try to increase the number if

you experience convergence problems.

The Advanced Design Settings dialog

Generate the design

A sequence of operations is performed when the Generate button is pressed. First the

candidate point list is generated based on the constraints. The number of candidate points is

the effective upper limit on the number of design points, and a warning will be given if too

many design points have been specified. Also the center point coordinates are generated

and will be displayed in the Additional Experiments tab. Then the specified number of D-

optimal points is found by the exchange algorithm, before these points are supplemented

with the specified number of space filling points and finally with the number of additional

center points.

The resulting design matrix is returned and the condition number is displayed in the Design

Experiment Wizard. The condition number is an indication of the orthogonality of a design,

and the lower condition number the better.

If three or more variables are defined to have Mixture constraints, a D-optimal design can be

generated. If there is a combination of process and mixture variables, a D-optimal design is

the only available option. Also if the upper level of one or more of the mixture components

is lower than the Mixture Sum, or if additional constraints are imposed on them, the design

348

Design of Experiments

region may have a non-simplex shape. D-optimal designs should be used for non-simplex

design regions as the standard mixture designs will not work.

Such a design is set up in a similar manner to a D-optimal design without mixture

components. The main difference is that a mixture constraint including all mixture

components is added automatically. These are required to sum to 100%.

Note: Currently classical ANOVA and response surface plots are not available for

non-simplex and process/mixture designs. In order to take advantage of these

features, you might consider if a regular mixture design could be an alternative.

Available optimization designs are:

Inscribed Central Composite (ICC)

Faced Central Composite (FCC)

Box-Behnken (BB)

Use the radio buttons to select the most appropriate design. For more information on these

designs please refer to the Theory section.

Design Details: Central Composite and Box-Behnken designs

The star point distance is the distance from the origin to the axial points in normalized units

(i.e. given that upper and lower levels of factorial points are 1 and -1, respectively). The

default star point distance for CCC designs ensures rotatable designs. For ICC designs, the

inverted value is used, which will for give rotatable designs by default also for ICC designs.

The star point distance for FCC designs is always 1 (non-rotatable).

The following table is given as a guide to find the most appropriate design:

Uses point outside

Number of

Design high and low Accuracy of estimates

levels

levels

349

The Unscrambler X Main

Number of

Design high and low Accuracy of estimates

levels

levels

Inscribed 5 No

design space

Faced 3 No

for pure quadratic coefficients

Box-Behnken 3 No uncertainty on the edge of the

design area

Mixture designs

Whenever three or more variables with Mixture constraints are defined, and there are no

other variables in the design, the mixture design tab is accessible.

Design Details: Mixture design

Axial

In an axial design all points lie on axes that go from each vertex through the overall

centroid, ending up at the opposite surface or edge. At these end points the

component in question is zero and the remaining components have equal

concentrations.

The end points allow the study of blending processes where each component may

be reduced to zero concentration. These can optionally be left out from the

experiment by un-checking the Include end points box.

Simplex lattice

A simplex lattice design is the mixture equivalent of a full-factorial design where the

number of levels can be tuned. It can be used for both screening and optimization

purposes, according to the lattice degree of the design.

350

Design of Experiments

The Lattice degree equals the number of segments into which each edge is divided.

This corresponds to the maximal order that can be calculated for the subsequent

model. Edit the degree by changing the default value.

Simplex centroid

A Simplex centroid design consists of extreme vertices, center points of all “sub-

simplexes”, and the overall centroid. A “sub-simplex” is a simplex defined by a

subset of the design variables.

Simplex centroid designs are well suited for optimization purposes. If Augmented

design is checked, axial check blends are added to the design. These are the same as

the Axial points in an Axial design.

Adjust mixture levels

There are certain limitations on which ranges are allowed for the components in a

mixture design:

1) The design levels must be consistent. This has to do with the mixture constraint

that all component concentrations must sum to the Mixture Sum (100%). If for

instance the lower level of one component is constrained to 20%, the upper level of

the remaining components cannot exceed 80% (see image below).

2) Any (consistent) design region has to be of simplex shape, i.e. it must form a

triangle for 3 components, a tetrahedron for 4 components, etc. Imposing upper

limit constraints on some of the mixture components will often lead to a non-

simplex design region.

A mixture design is automatically tested for condition 1) above, and if the design is

consistent it is tested for condition 2). If either test fail, a warning is given and an

Adjust mixture levels button is activated. Clicking this button will open an adjust

mixture levels dialog with several options.

Adjust Mixture Levels

Make levels consistent: Active whenever the test for consistency fails. The

bounds will be adjusted for consistency with the mixture constraint.

351

The Unscrambler X Main

done to the constraints within the dialog. Reverts any modifications to those

originally defined.

Adjust with normalized levels: Active whenever any range differs from the

default [0, 100%]. All mixture bounds will be adjusted to their maximum

range as bounded by 0 and the Mixture Sum.

simplex. Applies any changes to the constraints, closes the dialog and

switches to the tab for D-optimal designs.

simplex. Applies a general adjustment to turn the experimental region into a

simplex shape. The pre-defined upper and lower levels may be exceeded.

On pressing OK, the upper and lower levels of the components are updated with the

new values. If Cancel the dialog is closed without taking any changes into account.

Only when the mixture design is both consistent and of simplex shape will the Finish

button be activated in the Design Experiment Wizard.

In the situations where imposed upper bounds or multilinear constraints lead to a non-

simplex design region, or where a combination of mixture and process variables are to be

analysed a D-optimal design is required.

This tab allows one to manage the replication of the design as well add center points and

reference samples.

It includes four sections:

Design variables

Replicated samples

Center samples

Reference samples

352

Design of Experiments

Design variables

The design variables table provides a running summary of the design variables’ levels and

constraints.

Replicated samples

The number of replicated samples indicates the number of times the base design

experiments are run. Replication is used to measure the experimental error. Usually this is

done on center samples, however increasing the number of replicates in the design

improves the precision estimates of the design, by measuring replicates over the entire

design space. It is suggested to use at least two replicates of the design if the experimental

results are likely to vary significantly during the running of the experiment.

Note: Replicates (or replicated samples) are not the same as repeated

measurements. Replicates require a new experiment to be run using the same

settings for the design variables with a new experimental setup, while repeated

measurements are measures performed on the same samples numerous times in a

short time period.

Center samples

Center samples are used as a test for curvature and as a source for error variance

estimation. In the latter case, use at least two (preferably three or more) center samples as

this improves the precision of any estimates. By default the Design Experiment Wizard

suggests a number of center samples. These can be modified by using the spin box next to

Number of center samples.

The center samples are experimental runs at the mid-level of the design variable ranges

when all design variables are continuous. This corresponds to the average (mean) of the

different variables in the design.

If 1-4 variables in the design are categorical and at least one is continuous, center points can

still be defined, however these are only defined for the continuous variables in the design.

353

The Unscrambler X Main

Then a specified number of center point will be given for all combinations of categorical

levels. This ensures that the resulting design remains orthogonal.

An example is shown below for the simplest 2 factor factorial design at two levels, with one

category and for the 3 factor case with one center point defined.

Center point configurations of two factorial designs with one category variable

For the above designs it can be seen that two center points are required when there is one

categorical variable in the design. The center point is located at the mid-point of the

remaining continuous variables. The diagram below shows the 3 factor design with two

categorical variables, in which case 22 = 4 center points are needed.

In the situations described above, one replicate of center points was defined. In this case,

pure error cannot be calculated as the center points are all unique. In order to calculate pure

error, replicates of these center points is required. For the 2 factor design, two replicates of

center points yields 4 center points in total. Each center point now provides 1 degree of

freedom each per categorical level, i.e. 2 degrees of freedom in total for pure error.

For the 3 factor example with two categorical variables, two replicates of center points

results in 8 runs for center points alone. In this case, there are 4 unique center points,

therefore this situation provides 4 degrees of freedom for pure error. The more categorical

variables, the more center points are required, i.e. 2 center points minimum per categorical

variable. If replication is required, the number of center points can increase rapidly, to the

point where the number of center points exceeds the number of design points. In these

354

Design of Experiments

combination of a design replicate and a single replicate of center points. This depends on the

goal of the design and the budget one has for the experimentation. Also, refer to the section

below on modification of center points which describes how to modify and delete specific

center points.

Note: For designs with more than 1-2 categorical variables, it is usually both more

informative and more economical to replicate the entire experiment than to add

center points.

It is possible to modify center points by double-clicking on the sample, which will open a

dialog box for editing.

Modify center sample

In the example presented here, variable D is categorical. Its value can be changed using the

drop-down list. It is also possible to delete this specific center sample by clicking on the

Delete button. When the level values for the category variables have been specified, click

OK.

Reference samples

In the field reference samples, it is possible to define samples which are incorporated for

comparison. A typical reference sample is a target sample, a competitor’s sample or a

sample produced after changes to a given recipe. The values of the design variables are not

entered and are set as missing; it can be modified later in The Unscrambler®.

8.3.7 Randomization

This tab allows a user to randomize the order of the experiments.

Randomization tab

355

The Unscrambler X Main

sometimes necessary to perform some experiments in sequence, for example, if a

parameter is difficult to change (for example, the temperature of a blast furnace). In such

cases, it may be more practical to make all experiments with the same temperature at the

same time. In the Randomization tab, it is possible to specify blocks of similar samples to be

kept together during randomization.

Designed variables to randomize

This table displays the randomization pattern of the designed variables. It is possible

to edit the randomization pattern of the variables by clicking on the Detailed

randomization button.

By clicking on this button a new window opens. The selected variables (including

center points) will be randomized. When the desired pattern has been achieved,

click OK.

Define randomization

Randomized experiments

This table shows the sequence of experiments to run.

356

Design of Experiments

Re-randomize

If for any reason it is necessary to change the order of the samples, select the Re-

randomize button, and a new sequence of experiments will be generated.

8.3.8 Summary

This tab gives a summary of the complete design set-up, as well as the ability to calculate the

power of the design to detect small changes in the individual responses. A small change

means that the effect should be significant at a 5% level.

Summary tab

Enter the following parameters into the respective fields:

Delta

the required difference to detect in the response for successful experimentation

Std. dev. (also called Sigma)

the estimated standard deviation on the reference method used to obtain the

response

The ratio Signal to Noise (S/N) is provided as an indication.

Click the Recalculate power button. The power for each response variable will be

displayed in the Power field.

The power of the design is its estimated ability to detect small but real changes in the

response values. Traditionally a power of 80% is regarded to be good, which would imply a

20% probability of overlooking small effects. If the power of a design is low, one risks

performing expensive and time-consuming experiments that will not provide any answers.

Increase the power by adding additional experiments to the design, e.g. perform an

additional replication.

This tab shows the list of experiments to perform.

Design table tab

357

The Unscrambler X Main

Randomized or Standard sequence

Randomized sequence is the sequence defined in the Randomization section, which

corresponds to the run order. Standard sequence is an ordered sequence

convenient for display.

Display order

Actual values (or Actuals) are the levels as specified in the Define Variables tab,

these are the original units of the design variables.

Design levels are the levels in normalized units, i.e. [-1, 1] for factorial (process)

variables and [0, 1] for mixture components. Also called Level indices or Reals.

Display values

After selecting the Finish button, the design matrices will be generated in The Unscrambler®

project navigator.

To modify or extend a design, use the menu option Tools - Modify/Extend Design….

Modify/Extend Design menu

358

Design of Experiments

A dialog box will appear where one can select the appropriate design matrix to modify in the

field Choose design.

Modify/Extend Design dialog box

The Design Experiment Wizard will open. The History field of the Start tab will be modified,

and all the variables will be loaded with their previous settings.

Modified History field

Give the new design a unique name, modify any settings and click Finish when satisfied. This

will create a new design table in the project navigator.

All response values will be set to zero in the modified design.

Check the Insert – Create design… section to get more information about the design wizard.

8.4.1 To remember

When extending a design where some experiments have been already run, it is

recommended to add some extra center samples to check for bias with time with the

analysis.

Refer to the theory-section Extending a design for more details.

359

The Unscrambler X Main

After clicking on Finish in the Create Design dialog, the design table is displayed in The

Unscrambler® project navigator. The design table contains all design variables (with

interactions), followed by the response variables and non-controllable variables (when

applicable).

The Design table is divided into sets (column ranges) depending on the model complexity:

Designs not containing mixture variables contain some or all of the sets:

Design

Response

Non-controllable

Main effects

Main effects + Interactions (2-var)

Main effects + Interactions (2- and 3-var)

Main effects + Interactions (2-var) + quadratic

Main effects + Interactions (2-var) + quadratic + cubic

Main effects + Interactions (2 and 3-var) + quadratic + cubic

Design

Response

Non-controllable

First order (Linear)

Second order (Quadratic)

Special cubic

Full cubic

Main effects + Responses

The tables are also divided into three to five sample sets (row ranges):

All samples

All design samples

Center samples

Design and center samples

Reference samples

360

Design of Experiments

There are two ways in which to order the samples:

Standard: This is the accepted standard order for design variables. In particular,

factorial designs adopt the standard (1), a, b, ab, … notation.

Randomized: This order is the one generated after randomization, it provides the

experimental sequence the runs should be performed in.

The order can be changed by the clicking on one of the two columns and then selecting Edit-

Sort and then choosing Ascending or Descending.

Sort menu

There are two ways to view the design levels in the table: either in actual values or in leveled

index levels.

Change between these views by ticking or unticking the Level indices option available in the

View menu.

Go to Tasks - Analyze - Analyze Design Matrix… to open the Design Analysis dialog. The first

tab is the Model Inputs tab where the input data are specified along with which interactions

or higher order terms to include in the model. The Method tab suggests alternative analysis

strategies based on the input data and allows you to select the preferred method.

361

The Unscrambler X Main

Model Inputs

Select the Predictors and Responses to analyze. Only data tables created using the

Design Experiment Wizard (Insert–Create Design…) are accepted as input.

Usually the predefined column sets Design and Response should be selected in the

Cols box of the Predictors and Responses, respectively. Select All rows. Note that

selecting less or more data may alter desireable properties of the design.

Select the Effects to include in the model. It can include more or less terms. Try a

simpler model first.

In subsequent analysis, terms can be removed or added to the model. Select the

relevant effects and use the Move button to add/remove them from the analysis.

For factorial designs with no category variables and at least one centre point, there

is an option to calculate Curvature. A Curvature term can be found in the Not

Estimated box and is calculated by moving it to the Estimated box. Curvature

removes one degree of freedom from Lack of Fit calculations and is used to

determine whether the model is linear or not. Note that even if the curvature term

is added in the ANOVA, the final model (i.e. regression coefficients and predicted

responses) does not include the curvature term. Because the residual degrees of

freedom is reduced when testing for curvature, avoid using it indiscriminantly.

Note: The test for curvature will also remove some variation from the error term. In

some cases this may result in a low p-value for the model even though the model

itself does not include the curvature term. Therefore you should always verify your

final model by recalculating without curvature.

362

Design of Experiments

Method

Most designs may be analyzed using Classical DoE Analysis, which performs individual

ANOVAs for each response. If the design is heavily constrained or if multiple correlated

responses should be analyzed together, Partial Least Squares Regression may be a better

option. Other changes to a design such as modified factor levels or missing values might also

favour PLSR over ANOVA in some cases. Please refer to the theory section for a discussion

on the limitations of ANOVA.

The Method tab displays some useful properties of the design to make it easier to decide on

the best analysis method.

Design Type: This is the type of the design.

Modified: If at least one of the design level values has been modified in the past,

this value will be set to Yes. Depending on the magnitude of the change, this may

have a high or low impact on the orthogonality properties of the design.

Kept-out samples: While all samples may be very important in a design, especially

non-replicated ones, things may happen during the experiment or data collection

that lead to missing response values for some samples. This may severely reduce the

quality of the design. The number of kept-out or missing samples in the data table is

given here.

363

The Unscrambler X Main

interpret them under the assumption that they are independent is a difficult (and

risky) endeavor. This value is the highest of all pairwise, squared correlations

between responses. If the value is higher than 0.5 PLSR is suggested by default.

Condition Number: Constrained (D-optimal) designs and designs with modified

levels or missing runs will be non-orthogonal. As valid interpretation of an ANOVA

model relies on independent design parameters, highly non-orthogonal designs

should be analyzed using Partial Least Squares Regression rather than Classical DoE.

An orthogonal design has condition number of 1, and for any non-mixture design

with condition number larger than 100 Partial Least Squares Regression will be

selected by default. If the value is larger than 1000 Classical DoE will be disabled.

D-efficiency: This property of the design is closely related to the D-optimality

criterion. A factorial design without center points has a D-efficiency of 100%. This

value decreases if additional points are added that do not contribute to making the

design more orthogonal, or if constraints are added to the design. Useful for

assessing the quality of D-optimal designs.

Note: Modify design levels with caution, as such changes to the design matrix

cannot currently be undone (change back manually or use Tools–Modify/Extend

design if needed).

Note: Mixture designs are by definition non-orthogonal and can have both large

condition numbers and small D-efficiencies. These design can still be analyzed using

Classical DoE.

Select the preferred analysis method using the radio buttons and click OK to perform

analysis.

Analysis with ANOVA

364

Design of Experiments

A message will appear asking whether you want to display the model plots. Click on Yes or

No and the model will be added to the project navigator named “DOE Analysis”. Each model

contains the nodes Raw data and Results, and, if you decided to display it, Plots. There will

always be an option to right click on the model node in order to show or hide plots.

DOE Analysis results from a classical analysis in project navigator

365

The Unscrambler X Main

For further information on how to interpret the plots that are generated, please refer to the

section on interpreting DoE plots.

Depending on the method selected to analyze the design data, different results will be

plotted. Select one of the following methods to see the appropriate plot interpretation.

Accessing plots

Available plots for Classical DoE Analysis (Scheffe and MLR)

ANOVA overview

ANOVA table

Summary

Variables

Model check

Lack of fit

Diagnostics

Effect visualization

Effect summary

Effect and B-coefficient overview

Regression coefficients and their confidence interval

B-coefficient table

Effect visualization

Effect summary

Residuals overview

Normal probability of Y-residuals

Y-residuals vs. Y-predicted

Histogram of Y-residuals

Y-residuals in experimental order

ANOVA table

Diagnostics

B-coefficients

Regression coefficients and their confidence interval

B-coefficient table

Effect visualization

Effect visualization

Effect summary

Cube plot

Error table

Predicted vs. Reference

Response surface

Response surface plot

Response surface table

Multiple comparison

Multiple comparison plot

Group table

Distance table

B-coefficient table

Available plots for Partial Least Squares Regression (DoE PLS)

Overview

366

Design of Experiments

Explained Variance

PLSR ANOVA p-values

Predicted vs. Reference

PLS-ANOVA Summary table

Normal probability plot

X- and Y-Loadings

On finishing the calculation of a DoE model, the user is asked whether to view the plots or

not. Answering Yes will generate a sub-branch of the model called Plots in the project

navigator. This branch contains a number of readily accessible plot nodes.

Project navigator plot nodes

The availability of these plots is toggled by the options ‘Show plots’/’Hide plots’, accessible

from right clicking on the DoE model in the project navigator. This will add or remove the

Plots branch to the model. The plots are also available from the toolbar or from right-

clicking in any of the plot windows.

8.8.2 Available plots for Classical DoE Analysis (Scheffe and MLR)

ANOVA overview

The ANOVA overview plot node contains four plots. The plots described below are given for

all Plackett-Burman, Fractional Factorial and Full Factorial designs (unless otherwise noted).

For Optimization and Mixture designs, the Effect visualization and Effect summary plots are

replaced with a Response surface plot and table.

ANOVA table

The ANOVA table contains all sources of variation included in the model.

Sums of squares (SS)

367

The Unscrambler X Main

This is an unscaled measure of the dispersion or variability of the data table. It is the

sum of squares of the distance from the samples to the average point. It increases

with the number of samples.

All calculations are based on coded levels, i.e. the variable ranges are scaled

between [-1, 1] for process variables and between [0, 1] for mixture variables.

Degrees of freedom (DF)

The number of degrees of freedom of a phenomenon is the number of independent

ways this phenomenon can be varied. In the model there is one DF for each

independent parameter estimated.

Mean squares (MS)

This is the ratio of SS over the degrees of freedom. It estimates the variance, or

spread, of the observations of the different sources in a comparable unit.

F-ratio

This is the ratio between explained variance (associated to a given predictor) and

residual variance. F-ratios are not immediately interpretable, since their significance

depends on the number of degrees of freedom. However, they can be used as a

visual diagnostic: effects with high F-ratios are more likely to be significant than

effects with small F-ratios.

p-value

A small value (for instance less than 0.05 or 0.01) indicates that the effect is

significantly different from zero, i.e. that there is little chance that the observed

effect is due to mere random variation.

There are several types of sources of variations grouped in different parts of the table:

Summary

Variables

Model check

Lack of fit

In addition, some Quality values are found at the end of the table, including:

Method used

This refers to the type of samples used to calculate the error values. It can take three

values:

Design: the design is not saturated so the error values can be calculated on

the residual degree of freedom from the model.

experiments: the center samples.

experiments: the reference samples.

R-square

Coefficient of multiple determination. A value close to 1 indicates a good fit, while a

value close to 0 indicates a poor fit.

368

Design of Experiments

Adjusted R-square

Coefficient of multiple determination adjusted for the DF. While R-square will

increase towards 1 as more parameters (effects) are added to the model, this

statistic will favour additional terms only if the increase in SS is sufficiently high.

design experiments

R-square prediction

R-square on the predicted values, which is most conservative of the three R-squares

and says something about the predictive ability of the model.

S

Estimate for standard deviation (Root Mean Squared Error of Calibration; RMSEC)

Mean

Average value of the reference Y values on samples taking part in the analysis.

C.V. in %

Coefficient of variation is a normalized measure of dispersion of a probability

distribution. The standard deviation expressed as a percentage of the mean.

PRESS

PRediction Error Sum of Squares is an estimate of the dispersion of leverage

corrected residuals. It accounts for the predictive ability of the model in the sense

that each residual value is estimated as if the sample was left out from the model

calibration. The magnitude of this statistic can be compared with the corrected total

SS (the smaller the better).

ANOVA table

369

The Unscrambler X Main

Summary

The first part of the ANOVA table tests the significance of the model when all specified

effects are included. If the model p-value is small (e.g. smaller than 0.05), it means that the

model explains more of the variation in the response variable than could be expected from

random phenomena. In other words, the model is significant at the 5% level. The smaller the

p-value, the more significant (and useful) the model is.

Variables

The second part of the ANOVA table deals with each individual effect (main effects,

optionally also interactions and square terms). If the p-value for an effect is small, it explains

more of the variations of the response variable than could be expected from random

phenomena. The effect is significant at the 5% level if the p-value is smaller than 0.05. The

smaller the p-value, the more significant the effect is.

There are different ways to calculate sums of squares (SS), however for orthogonal designs

such as factorial designs they all give the same results. For non-orthogonal designs such as

D-optimal and mixture designs, this section tests the so-called Marginal (Type III) SS. This

corrects for the contribution of all other terms in the model irrespective of order, however

the individual contributions may not sum to the Model SS.

370

Design of Experiments

Model check

The model check tests whether it is beneficial to add terms of successively higher order to

the model. For orthogonal designs such as factorial designs, the individual contributions of

the terms of a particular order sum to the model check SS. If the p-value for a group of

effects is large it means that these terms do not contribute much to the model and that a

simpler model should be considered.

For D-optimal and mixture designs, the so-called sequential (Type I) SS is given in the Model

check section. Also higher order terms than the ones actually included in the model are

given here when relevant. This section will indicate the optimal complexity of the model

when adding terms in a hierarchical manner (i.e. lower order terms added before higher

order terms). If all tested terms are included in the model, the sum of contributions will

equal the Model SS.

Lack of fit

The lack of fit part tests whether the error in response prediction is mostly due to

experimental variability or to an inadequate shape of the model. If the p-value for lack of fit

is smaller than 0.05, it means that the model does not describe the true shape of the

response surface. In such cases, it may be helpful to apply a transformation to the response

variable.

Note:

For screening designs, the model can be saturated. In such cases, one

cannot use the design samples for significance testing; the center samples

or reference samples are used.

If the design has design variables with more than two levels, use the

Multiple Comparison plot and B-coefficient table in order to see which

levels of a given variable differ significantly from each other.

Lack of fit can only be tested if the replicated center samples do not all

have the same response values (which may sometimes happen by

accident).

Diagnostics

This plot presents several values for assessing the quality of the fit of the model to each

individual response.

Standard Order

The standard order is the non-randomized order from the experiment generator

Actual Value

This is the measured response values as given in the design table.

Predicted Value

This is the fitted response value as calculated from the model.

Compare this value to the actual value; the closer those values are the better is the

fit to the model.

Residual

This is the difference between the actual and the predicted value.

Study all the values; the smaller they are the better is the fit by the model. Note that

it does not say anything about the predictive ability of the model when applied to

new samples.

Leverage

371

The Unscrambler X Main

The leverage is the distance of the projected samples to the center of the model. A

sample with high leverage is an influential sample or an outlier. Note that for

saturated models, the leverage is 1 for all samples and there is no residual DF to

estimate error in the model.

Student Residual

A studentized residual is the result from the division of a residual by the estimate of

the sample dependent standard deviation of the residual. The presented values are

the so-called internally studentized residuals, meaning that all samples have been

included in the estimation of the standard deviation. This statistic is can be used for

detection of outliers. For any reasonably sized experiment (e.g. n>30), 95% of

normally distributed, studentized residuals will fall in the interval [-2, 2].

Cook’s Distance

The Cook’s distance of an observation is a measure of the global influence of this

observation on all the predicted values. This is done by measuring the effect of

deleting this given observation. Data points with large residuals and/or high leverage

may distort the outcome and accuracy of a regression.

The Cook’s distance gives an actual threshold to judge the samples. Points with a

Cook’s distance of 1 or more are considered to be potential outliers.

Run Order

The run order is the (randomized) order of experimentation. There should not be a

run-order dependent trend in the above diagnostic tools.

Diagnostics

Effect visualization

This plot displays one effect at a time for a given response. To change the displayed effect

and the response click on the arrows or on one of the cells of the “Summary of the

effects” table.

It is useful to study the magnitude of the effects (change in the response value when the

design variable increases from Low to High) and the interactions.

There are two types of effects that can be visualized.

Main Effects

The plot shows the average response value for a specific response variable at the

Low and High levels of the design variable. If there are center samples, the average

response value for the center samples is also displayed. It is useful to study the

magnitude of the main effect (change in the response value when the design

variable increases from Low to High). If there are center samples, one can also

detect a curvature visually. For category variables with more than two levels, the

average response value for each category level is given.

Main effects with curvature

372

Design of Experiments

Interaction effects

The plot shows the average change in response values for a design variable

depending on the level of the other variable in a two-factor interaction. One line is

given for the Low level of the second design variable, and one line is given for the

High level of the second design variable.

It is possible to study the magnitude of the interaction effect (1/2 * change in the

effect of the first design variable when the second design variable changes from Low

to High).

For a positive interaction, the slope of the effect for “High” is larger than for

“Low”;

For a negative interaction, the slope of the effect for “High” is smaller than

for “Low”;

For no interaction the curves are parallel.

Effect summary

This table plot gives an overview of the significance of all effects for all responses. There are

three values per effect and per response:

373

The Unscrambler X Main

Significance: This coded value indicates if the effect is significant for the specific

response. The significance level is also reflected by the color of the row. See the

Significance levels and associated codes table below.

Effect value: This is the value of the effect for the specific response variable.

p-value: Result of the test of significance for the effect.

Significance levels and associated codes

P-value limits Negative effect Positive effect Color code

NS: non-significant. ?: Marginally significant (alpha-level 10%).

Look for rows which contain many ”+” or ”–” signs and are green: these main effects or

interactions are most important for explaining the variance of the response in question.

If the design contains category variables with 3 levels or more, the effects table is replaced

with a multiple comparison plot in the ANOVA overview.

Effect and B-coefficient overview

This overview is available for all designs that contain continuous or 2-level category variables

only. For category variables with 3 levels or more, no single regression coefficent or effect

can describe the variable in question and these plots would be less informative.

This plot shows the value of the regression coefficients with their confidence intervals (CIs)

for one response variable.

The bigger the coefficient the more important the design variable for the response variable.

The smaller the CI the more accurate the coefficient.

Regression coefficients with their CI

374

Design of Experiments

Use the arrows to navigate from one response variable to another or click on the

Response variable to be plotted in the table Regression coefficient table.

B-coefficient table

This table presents the value of the B-coefficient for the associated design variables as well

as B0.

It also gives the 95% confidence interval for the B-coefficients. These values give an idea of

the accuracy of the estimate of the coefficients.

The p- and t-values are computed to test the null hypothesis, H0: the coefficient is equal to

0. Rejection of this hypothesis for a variable means that the variable is important for

describing the response in question. By comparing the t-value with its theoretical

distribution (Student’s T-distribution), the significance level of the studied effect is obtained.

The associated p-value represents the significance of the effect associated with the B-

coefficient. H0 can be rejected if the p-value is smaller than, say 5% (green color). This

implies that the effect in question is important for modelling the response.

B-coefficient table

Effect visualization

This plot is shown for all designs except mixture designs. For more information on this plot,

check the ANOVA overview section.

Effect summary

For more information on this plot, check the ANOVA overview section

375

The Unscrambler X Main

Residuals overview

These plots can be used to check the adequacy of the model or look for outliers, provided

that there are ample residual degrees of freedom left to study the residuals. If the model is

close to saturated, i.e. the number of effects is almost as high as the number of

observations, artificially structured residuals will result that cannot be interpreted properly.

This is a normal probability plot of the residuals of all the modelled effects. If effects are well

modelled, the residuals should contain unstructured noise only. Effects in the upper right or

lower left of the plot that do not approximately follow a straight line going through the rest

of the points, deviate from the normal distribution. This is an indication that the model is not

describing the sample very well – it may be an outlier.

The abd sample in the plot below is a typical example of an outlier. In this particular

example, it was found that the reason was a mis-typed response for that sample. After

correction the residuals of both abd and cef looked more like random noise.

Normal probability of Y-residuals

This is a plot of Y-residuals against predicted Y values. If the model adequately predicts

variations in Y, any residual variations should be due to noise only, which means that the

residuals should be randomly distributed. If this is not the case, the model is not completely

satisfactory, and appropriate action should be taken. If strong systematic structures (e.g.

curved patterns) are observed, this can be an indication of lack of fit of the regression

model. The figure below shows a situation that strongly indicates lack of fit of the model.

This is typical for a model that would benefit from including quadratic terms.

Structure in the residuals

376

Design of Experiments

The presence of an outlier is shown in the example below. The outlying sample has a much

larger residual than the others; however, it does not seem to disturb the model to a large

extent.

A simple outlier has a large residual

The figure below shows the case of an influential outlier: not only does it have a large

residual, it also attracts the whole model so that the remaining residuals show a very clear

trend. Such samples should usually be excluded from the analysis, unless there is an error in

the data table that can be corrected.

An influential outlier changes the structure of the residuals

377

The Unscrambler X Main

Small residuals (compared to the variance of Y) which are randomly distributed indicate

adequate models.

Histogram of Y-residuals

This plot shows the distribution of the residuals, optionally with a statistics table displayed.

Histogram of Y-residuals

A symmetric bell-shaped histogram which is evenly distributed around zero indicates that

the normality assumption is likely to be true. This is the case in the above plot. Moderate

departures from normality is usually acceptable. Change the resolution of the histogram by

toggling the number of bars in the toolbar.

378

Design of Experiments

This plot is a line/bar plot of the Y-residuals in experimental order. It is used to detect if

there is a time-dependent trend in the experimentation. If the Y-residual increases with the

time of experimentation some non-randomized variationis occurring. The experimentation is

biased with a factor that varies with time. Try to identify it.

This plot can also detect if the variance/spread of the residuals changes over time, which

might violate the constant variance assumption.

Y-residuals in experimental order: No apparent time-effect (left), clear time-dependent effect

ANOVA table

For more information check the ANOVA overview section

Diagnostics

For more information check the ANOVA overview section

B-coefficients

This plot node is available for all designs except designs with categorical design variables

with three levels or more and for mixture designs.

For more information on this plot, look at the section Effect and B-coefficient overview

B-coefficient table

For more information on this plot, look at the section Effect and B-coefficient overview

Effect visualization

This plot node is available for all designs except designs with categorical design variables

with three levels or more and for mixture designs.

Effect visualization

For more information check the DoE overview section

379

The Unscrambler X Main

Effect summary

For more information check the DoE overview section

Cube plot

This plot is available for all factorial designs (incl. Plackett-Burman). It displays the average of

a specified response variable at the experimental points.

Cube plot

The plot is most useful when there are two or three design variables. If there are more than

three design variables it is possible to choose which cube to represent using the arrows for

X, Y and Z.

Error table

The error table is a summary of the quality parameters available for the analysis of design

data. See ANOVA table for a description of the individual terms.

Error table

This is a scatter plot of the predicted response values vs. the reference/measured values.

The better the fit, the closer the values will fall on a straight line. See section on calibration

values in Predicted vs. Reference for details.

Predicted vs. Reference plot

380

Design of Experiments

Response surface

There are two types of response surface (RS) plots. A square response surface is given for

non-mixture designs and a triangular response surface is given for mixture designs.

This plot is used to find the levels of the design variables that will give an optimal response,

and to study the general shape of the response surface. It shows the response surface for

one response variable at a time.

Look at this plot as a map which tells how to reach the experimental objective. Two design

variables are studied over their range of variation; the remaining ones are by default held

constant at their mean level. The levels of the non-plotted variables can be tuned in the RS

table. For mixture designs, three components are plotted, and the response surface has a

simplex (triangular) shape.

The response surface is initially viewed from the top, i.e. the axis showing the predicted

response points out from the plot and contour lines indicate where the predicted response

has the same value. Pointing the cursor anywhere in the design region will show the

coordinate values as well as the predicted response value for that point. A color-bar

translates the colors into levels of response values.

Response surface plot

381

The Unscrambler X Main

The response surface can also be rotated and viewed in 3D from any angle using the mouse:

Rotated response surface plot

382

Design of Experiments

Different representations of the response surface can be seen selecting the options in tool

bar for Mesh, Floor Contour or Surface Contour.

Response surface right click options

The following options are available from the right click menu in a response surface plot.

From the DOE menu all available analysis plots can be accessed.

Click View to switch between Graphical or Numerical view (also accessible from the toolbar),

or to toggle the colorbar (Legend) on or off.

Copy a bitmap representation to the clipboard for pasting into other applications, or Save

Plot using either of the formats JPEG, PNG, BMP, PNM or TIFF.

383

The Unscrambler X Main

The Auto Scale option available from the right click or toolbar menu will return to a default

size 2D-plot.

The following Properties can be tuned from the plot properties dialog:

Appearance

or off. Also accessible as a toolbar check box when the response surface plot

is active.

Also accessible as a toolbar check box when the response surface plot is

active.

Also accessible as a toolbar check box when the response surface plot is

active.

Plot Font

Bold: Toggle bold font for title, axis, colorbar and tooltip text on and off.

Italic: Toggle italic font for title, axis, colorbar and tooltip text on and off.

Name: Switch between font families Arial, Courier and Times for title, axis,

colorbar and tooltip text.

Size: Set font size as a relative number. The plotting library automatically

attempts to find the best font size for different text. You may increase or

decrease the size of all plot text within the range of 0.1 (very small) and 4.0

(very large).

This table is used to select design and response variables to plot, to set the levels of non

plotted factors and optionally to impose optimization constraints on any of the design or

response variables. The latter is a very useful tool to find the optimum level combinations

for one or more responses. By imposing constraints on multiple responses simultaneously

and overlaying them in the same plot, it can immediately be seen which level combinations

are allowed and which fall outside of the (tuneable) optimization regions.

Design variables

In a response surface for non-mixture designs two design variables are plotted while

the others are fixed. For mixture designs, three mixture components are plotted in a

simplex plot.

To select the variables to plot, tick/uncheck the box in the Display column.

Optimization constraints for design variables can be set using the sliders or manually

enter the values in the Min and Max columns. The area outside of the selected

design region will be grayed out.

384

Design of Experiments

To set the level of the non-plotted variables enter the value manually in the column

Current. By default this value is the average value.

For mixture designs the levels of the components cannot vary independently of each

other, as the mixture constraint imposes that all components must sum to the

Mixture Sum always. Therefore, if a non-plotted variable is tuned, the axes and Max

levels of the plotted variables are updated accordingly. A minimum Max value

corresponding to 3.5% of the total range is enforced for plotted mixture

components.

For mixture designs there is an additional column with Freeze check-boxes. This is

useful for designs with 5 components or more. If the current level of a non-plotted

mixture variable is increased until the plotted variable axes cannot be reduced any

more, the levels of other non-plotted components will be reduced instead. If freeze

is checked for a non-plotted variable, its current value cannot be changed due to a

change in other variables.

For category variables select one of the levels using the drop-down list.

Response variables

Only one response variable can be plotted at a time. Select the response to plot by

ticking the variable of interest.

Optimization constraints for response variables can be set using the sliders or

manually enter the values in the Min and Max columns. Setting optimization

constraints for multiple responses simultaneously is a very useful tool for finding the

optimal design settings.

Response surface table

Multiple comparison

This node is given for non-saturated designs with at least one category variable. It shows

whether the distance between levels is larger than a critical distance, in which case the

levels are considered to belong in different groups. Because the critical distance is calculated

from the data, residual degrees of freedom are required for these plots to be displayed.

This is a comparison of the average of a given response variable for the different levels of a

design variable. It shows whether any of the levels are associated with a higher or lower

mean response compared to the other levels.

This plot displays one design variable and one response variable at a time. Use the the

toolbar arrows to switch between category variables and the toolbar drop-down

box for changing the response variable to display. If there is significant difference between

385

The Unscrambler X Main

one categorical level and the other levels, the average response values are plotted in

different groups along the X-axis.

Multiple Comparisons

The average response value is displayed as a red square and its value can be read on

the vertical axis or by mouse-over.

The levels are grouped along the horizontal axis by significantly different groups.

The names of the different levels can be seen by mouse-over.

Levels that are not significantly different are linked by blue vertical bars. Each

vertical bar is the size of half the critical distance. Two levels have significantly

different average response values if they are not linked by any bar.

The critical distance is indicated in the x-axis title.

Group table

The group table shows the levels associated with the different groups. This table takes the

value 1 if the level is part of the specified group and 0 if not. One level can be associated

with several groups.

Group table

Distance table

This table shows for a specific response variable and a specific category variable the distance

between the average value of two-by-two levels.

Distance table

386

Design of Experiments

B-coefficient table

For more information look at the description in the B-coefficients section. If one of the

categorical variables has three levels or more, an Effect visualization is plotted instead of the

B-coefficient table.

8.8.3 Available plots for Partial Least Squares Regression (DoE PLS)

When PLSR is performed on designed data all the regular PLSR plots are available. The DoE

PLS in addition has some plots useful for DOE purposes.

Overview

This plot displays the weighted regression coefficients for the optimal number of factors

with their uncertainty limits.

Stable weighted B-coefficients show an uncertainty limit that does not cross the 0-line.

Regression coefficients

Explained Variance

This is the total explained variance plot for models of an increasing number of components.

Use the toolbar buttons to switch between X-/Y-variance, calibration/validation variance and

387

The Unscrambler X Main

one-out) cross-validation.

Refer to the Explained variance plot in PLSR for more details.

This plot displays the p-values obtained from the uncertainty test of regression coefficients.

Small p-values indicate model terms that most likely have an important effect on the

response.

Four significance levels are given at 0.01 (dark green), 0.05 (light green), 0.1 (yellow) and 0.2

(red). Terms with p-values lower than one of the lines are significant at the corresponding

level.

PLSR ANOVA p-values

By default the Predicted vs. Reference plot shows the results for the first Y-variable. To see

the results for other Y-variables, use the arrows next to the response values. In

addition by default the results are shown for a specific number of factors (or PCs), that

should reflect the dimensionality of the model. If the number of factors (or PCs) is not

The selected predicted Y-value from the model is plotted against the measured Y-value. This

is a good way to check the quality of the regression model. If the model gives a good fit, the

plot will show points close to a straight line through the origin and with slope close to 1.

388

Design of Experiments

Turn on Plot Statistics (using the View menu) to check the slope and offset, RMSEP/RMSEC

and R-squared. Generally all the y-variables should be studied and give good results.

Note: Before interpreting the plot, check whether the plots are displaying

Calibration or Validation results (or both).

Menu option Window - Identification tells whether the plots are displaying Calibration (if

Ordinate is yPredCal) or Validation (yPredVal) results.

Use the buttons to switch Calibration and Validation results off or on.

It is also useful to show the regression line using the icon , and compare it with the

Some statistics are available giving an idea of the quality of the regression. They are

When Calibration and Validation results are displayed together as shown in the figure below,

pay special attention to:

Differences between Cal and Val

If there are large differences, the model cannot be trusted.

R-squared

The first one (in blue) is the raw R-squared of the model, the second one (in red) is

also called adjusted R-squared and tells how good a fit can be expected for future

predictions. R-squared varies between 0 and 1. A value of 0.9 is usually considered

as pretty good but this varies depending on the application and on the number of

samples.

RMSE

The first one (in blue) is the Calibration error RMSEC, the second one (in red) is the

expected Prediction error RMSEP. Both are expressed in the same unit as the

response variable Y.

Predicted vs. Reference plot for Calibration and Validation, with Plot Statistics turned on, as

well as Regression line and Target line.

389

The Unscrambler X Main

The figures below show two different situations: one indicating a good fit, the other a poor

fit of the model.

Predicted vs. Reference shows how well the model fits

How to detect outliers

One may also see cases where the majority of the samples lie close to the line while a few of

them are further away. This may indicate good fit of the model to the majority of the data,

but with a few outliers present (see the figure below).

Detecting outliers on a Predicted vs. Reference plot

In the above plot, sample 3 does not follow the regression line whereas all the other

samples do. Sample 3 may be an outlier.

How to detect nonlinearity

In other cases, there may be a nonlinear relationship between the X- and Y-variables, so that

the predictions do not have the same level of accuracy over the whole range of variation of

Y. In such cases, the plot may look like the one shown below. Such nonlinearities should be

corrected if possible (for instance by a suitable transformation), because otherwise there

will be a systematic bias in the predictions depending on the range of the sample.

Predicted vs. Reference shows a nonlinear relationship

390

Design of Experiments

This table presents the effect values for all variables as well as their significance levels and p-

values.

PLSR-ANOVA Summary table

P-value Negative effect Positive effect Color code

[0.10:0.05] ? ? yellow

NS: non significant.

?: possible effect at the significance level 10%.

391

The Unscrambler X Main

This is a normal probability plot of the Y-residuals after a given number of components. As

residuals are supposed to contain little or no structured variation, all the points should

ideally fall close to a straight line. See Normal probability of Y-residuals for more details.

X- and Y-Loadings

A 2-D scatter plot of X- and Y-loadings for two specified components (factors) from PLSR is a

good way to detect important variables and relationships between variables. The plot is

most useful for interpreting component 1 vs. component 2, since they represent the largest

variations in the X-data that explain the largest variation in the Y-data. By default both Y-

and X-variables are displayed but it is possible to modify that by clicking on the X and Y

icons.

Interpret the X-Y relationships

To interpret the relationships between X and Y-variables, start by looking at the response (Y)

variables.

Predictors (X) projected in roughly the same direction from the center as a response,

are positively linked to that response. In the example below, predictors sweet, red

and color have a positive link with response Pref.

Predictors projected in the opposite direction have a negative relationship, as

predictor thick in the example below.

Predictors projected close to the center, as bitter in the example below, are not well

represented in that plot and cannot be interpreted.

Glossiness, Meltiness), four process predictors (Amount of dry matter, maturity, pH and

addition of recycled dry matter)

The maturity has a negative effect on the adhesiveness of the cheese; they are anti-

correlated. The amount of Dry matter affects positively the stickiness and negatively the

glossiness and meltiness. Glossiness and meltiness, two responses, are correlated.

392

Design of Experiments

Caution! If the X-variables have been standardized, one should also standardize the

Y-variable so that the X- and Y-loadings have the same scale; otherwise the plot

may be difficult to interpret.

The plot shows the importance of the different variables for the two components specified.

preferably be used together with the corresponding scores plot. Variables with loadings to

the right in the loadings plot will be X-variables which usually have high values for samples

to the right in the scores plot, etc. This plot can be used to study the relationship between

the X-variables and the X- and Y-variables.

If the Uncertainty test was activated the important variables will be circled. It is also possible

to mark them by using the icon .

Loadings plot with circled important variables

identified.

When a PLSR analysis has been performed and a two-dimensional plot of loadings is

displayed on the screen, the Correlation Loadings option (available from the View menu and

the icon can be used to aid in the discovery of the structure in the data. Correlation

loadings are computed for each variable for the displayed factors. In addition, the plot

contains two ellipses to help check how much variance is taken into account. The outer

ellipse is the unit-circle and indicates 100% explained variance. The inner ellipse indicates

50% of explained variance. The importance of individual variables is visualized more clearly

in the correlation loadings plot compared to the standard loadings plot.

393

The Unscrambler X Main

Correlation Loadings of process variables (X) and the quality of the cheese (Y) along (factor

1,factor 2)

Variables close to each other in the loadings plot will have a high positive correlation if the

two components explain a large portion of the variance of X. The same is true for variables in

the same quadrant lying close to a straight line through the origin. Variables in diagonally

opposed quadrants will have a tendency to be negatively correlated. For example, in the

figure above, variables dry matter and stickiness have a high positive correlation on factor 1

and factor 2, and they are negatively correlated to variables meltiness and glossiness.

Variables adhesiveness and stickiness have independent variations. Variables addition of

recycled dry matter and pH are very close to the center, they are not well described by

factor 1 and factor 2.

Note: Variables lying close to the center are poorly explained by the plotted factors

(or PCs). They cannot be interpreted in that plot.

This document, which can be downloaded from our web site, details the algorithms used in

The Unscrambler® as well as some statistical measures and formulas.

http://www.camo.com/helpdocs/The_Unscrambler_Method_References.pdf

8.10. Bibliography

R. C. Bose and K. Kishen, On the problem of confounding in the general symmetrical factorial

design, Sankhya, 5, 21, (1940).

J.A. Cornell, Experiments with Mixtures: Designs, Models, and the Analysis of Mixture Data,

Second edition, John Wiley and Sons, New York, 1990.

G.H. Golub and C.F. Van Loan, Matrix Computations, Third edition, Johns Hopkins University

Press, 1996.

R.W. Kennard and L.A. Stone, Computer Aided Design of Experiments, Technometrics, 11(1),

137-148, (1969).

G.A. Lewis, D. Mathieu, and R. Phan-Tan-Lu, Pharmaceutical Experimental Design, Marcel

Dekker, Inc., New York, 1999.

394

Design of Experiments

D.C. Montgomery, Design and Analysis of Experiments, Sixth edition, John Wiley & Sons,

New York, 2004.

R.H. Myers and D.C. Montgomery, Response Surface Methodology: Process and Product

Optimization using Designed Experiments, Second edition, Wiley, New York, 2002.

T. Naes and T. Isaksson, Selection of Samples for Calibration in Near-Infrared Spectroscopy.

Part I: General Principles Illustrated by Example, Appl. Spectrosc., 43(2), 328-335, (1989).

N.-K. Nguyen and G.F. Piepel, Computer-Generated Experimental Designs for Irregular-

Shaped Regions, QTQM, 2(2), 147-160, (2005).

R.E.A.C. Paley, On orthogonal matrices, J. Math. Phys., 12, 311–320, (1933).

R.L. Plackett and J.P. Burman, The Design of Optimum Multifactorial Experiments,

Biometrika, 33, 305-25, (1946).

H. Scheffé, Experiment with Mixtures, J. Roy. Stat. Soc. Ser. B, 20, 344-366, (1958).

395

9. Validation

9.1. Validation

Model validation is performed for PCA or regression models to estimate how useful the

model will be for future observations. It returns the predictive ability of the model as

opposed to the model’s fit to the training data.

Theory

Dialog usage: Validation tab

Dialog usage: Cross validation setup

Validating a model based on empirical data means checking how well the model will perform

on new data of the same kind that was used in developing the model. The validation of a

model estimates the uncertainty of future predictions that may be made with the model. If

the uncertainty is reasonably low, the model can be considered valid. However, regression

methods are also applied for modeling relations between blocks of data without any

objective of implementing the model in a process or in an instrument.

This chapter presents the purposes and principles of model validation in multivariate data

analysis.

What is validation?

Test set validation

How to select a test set

Cross validation

Leverage correction

Validation results

When to use which validation method

Uncertainty testing with cross validation

How does the uncertainty test work?

Uncertainty of regression coefficients

Uncertainty of loadings and loading weights

Stability plots

Easier to interpret important variables in models with many

components

Remove non-significant variables for more robust models

Application areas

More details about the uncertainty test

Model validation check list

To keep this discussion as general as possible, it is written with focus on the case of a

regression model. However, the same principles apply to PCA and other methods.

For the case of validation of PCA results:

397

The Unscrambler X Main

Disregard the sections on RMSEP.

Validating a model based on empirical data means checking how well the model will perform

on new data.

A regression model is often made to do predictions in the future. The validation of the

model estimates the uncertainty of such future predictions. If the uncertainty is reasonably

low, the model can be considered valid. However, regression methods are also applied for

modeling relations between blocks of data without any objective of implementing the model

in a process or in an instrument.

The same argument applies to a descriptive multivariate analysis such as PCA: If the

objective of the PCA is to extrapolate the correlations observed in the data table to future,

similar data, one should check whether they still apply for new data.

In The Unscrambler® three methods are available to estimate the model stability and

prediction ability: test set validation, cross validation and leverage correction.

Test set validation

Test set validation is based on testing the model on a subset of the available samples, which

will not be present in the computations of the model parameters.

The global data table is split into two subsets:

Calibration set

contains all samples used to compute the model components, using X- and Y-values;

Test set

contains all the remaining samples, for which X-values are fed into the model once a

new component has been computed. Their predicted Y-values are then compared to

the observed Y-values, yielding a prediction residual that can be used to compute a

validation residual variance or an RMSEP.

A test set should contain 20-40% of the full data table. The calibration and test set should in

principle cover the same population of samples as well as possible. Samples which can be

considered to be replicate measurements should not be present in both the calibration and

test set.

There are several ways to select test sets:

Manual selection

is recommended since it gives one full control over the selection of a test set;

Random selection

is the simplest way to select a test set, but leaves the selection to the computer;

Group selection

makes it possible for the user to specify a set of samples as test set by selecting a

value or values for one of the variables. This should only be used under special

circumstances. An example of such a situation is a case where there are two true

replicates for each data point, and a separate variable indicates which replicate a

sample belongs to. In such a case, one can construct two groups according to this

variable and use one of the sets as test set. The group can be selected from one

chosen level of a category variable.

398

Validation

Cross validation

Though the objective is to have enough samples to put a reasonable amount aside as a test

set, this is not always possible due, for example, to the cost of samples or reference testing.

The best alternative to an independent test set for validation is to apply cross validation.

With cross validation, the same samples are used both for model estimation and testing. A

few samples are left out from the calibration data set and the model is calibrated on the

remaining data points. Then the values for the left-out samples are predicted and the

prediction residuals are computed. The process is repeated with another subset of the

calibration set, and so on until every object has been left out once; then all prediction

residuals are combined to compute the validation residual variance and RMSEP. It is of

utmost importance that the user is aware of which level of cross validation he wants to

validate. For example, if one physical sample is measured three times, and the objective is to

establish a model across samples, the three replicates must be held out in the same cross

validation segment. If the objective is to validate the repeated measurement, keep out one

replicate for all samples and generate three cross validation segments. The calibration

variance is always the same; it is the validation curve that is the important figure of merit

(and the RMSECV for regression models).

Several versions of the cross validation approach can be used:

Full cross validation

leaves out only one sample at a time; it is the original version of the method;

Segmented cross validation

leaves out a whole group of samples at a time. A typical example is when there are

systematic replicated measurements of one physical sample;

Test-set switch

divides the global data set into two subsets, each of which will be used alternatively

as calibration set and as test set;

Category variable

enables the user to validate across levels of category variables. This is useful for

evaluating how robust the model is across season, raw material supplier, location,

operator, etc.

When running a cross validation, one can get prediction diagnostics for the cross validation

segments. These are not available when full cross-validation is used. This option will provide

information on the validation results per each cross validation segment including RMSEP,

SEP, bias, slope, offset and correlation. The CV prediction diagnostics are added as a matrix

in the Validation folder of the PLSR model.

Leverage correction

Leverage correction is an approximation to cross validation that enables prediction residuals

to be estimated without actually performing any prediction. It is based on an equation that

is valid for MLR, but is only an approximation for PLSR and PCR.

According to this equation, the prediction residual equals

All samples with low leverage (i.e. low influence on the model) will have estimated

prediction residuals very close to their calibration residuals (the leverage being close to

zero). For samples with high leverage, the calibration residual will be divided by a smaller

number, thus giving a much larger estimated prediction residual.

In the earlier days of multivariate modeling, when computer power was a fraction of what it

is today, this method was applied in the initial modeling. Nowadays, the user typically has

399

The Unscrambler X Main

the possibility to perform cross validation for most data sets without much computation

time, making the leverage correction more of a relic of the old days.

The simplest and most efficient measure of the uncertainty on future predictions is the

RMSEP. This value (one for each response) is a measure of the average uncertainty that can

be expected when predicting Y-values for new samples, expressed in the same units as the

Y-variable. The results of future predictions can then be presented as predicted values ±

2*RMSEP. This measure is valid provided that the new samples are similar to the ones used

for calibration, otherwise, the prediction error might be much higher.

Validation residual and explained variances are also computed in exactly the same way as

calibration variances, except that prediction residuals are used instead of calibration

residuals. Validation variances are used, as in PCA, to find the optimum number of model

components. When validation residual variance is minimal, RMSEP also is, and the model

with an optimal number of components will have the lowest expected prediction error.

RMSEP can be compared with the precision of the reference method. Usually one cannot

expect RMSEP to be lower than twice the precision.

Test set validation can be used if there are many samples in the data table, for instance

more than 50.

It is the most “objective” validation method, since the test samples have no influence on the

calibration of the model.

Cross validation represents a more efficient way of utilizing the samples if the number of

samples is small or moderate.

Segmented cross validation is the fast approach, but full cross validation is also often

applied. The suggested rule of thumb is to do random 10-segment cross validation if there is

no reason to divide the samples into subgroups.

When using segmented cross validation, make sure that all segments contain unique

information, i.e. samples which can be considered as replicates of each other should not be

present in different segments.

The major advantage of cross validation is that it allows for the jack-knifing approach on

which an Uncertainty Test is based. This provides significance testing for PLSR results. For

more information, see Uncertainty testing with cross validation.

Leverage correction for projection methods should only be used in an early stage of the

analysis if it is very important to obtain a quick answer. In general it gives more “optimistic”

results than the other validation methods and can sometimes be highly overoptimistic.

Sometimes, especially for small data tables, leverage correction can give apparently

reasonable results, while cross validation fails completely. In such cases, the “reasonable”

behavior of the leverage correction can be an artifact and cannot be trusted. The reason

400

Validation

why such cases are difficult is that there is too little information for estimation of a model

and each sample is “unique”. Therefore all known validation methods are doomed to fail.

For MLR, leverage correction is strictly equivalent to (and much faster than) full cross

validation.

Users of multivariate modeling methods are often uncertain when interpreting models.

Frequently asked questions are:

Is the model stable?

Why is there a problem?

Dr. Harald Martens has (re-)developed a generic method for uncertainty testing, which gives

a safer interpretation of models. The concept for uncertainty testing is based on cross

validation, jack-knifing and stability plots. This section introduces how the Uncertainty Test

works and shows how it can be used in The Unscrambler® through an application.

The following sections will present the method with a non-mathematical approach.

How does the uncertainty test work?

The test works with PLSR or PCA models with cross validation, choosing full cross validation

or segmented cross validation as is appropriate for the data. When the optimal number of

components (factors) for PLSR have been chosen, tick Uncertainty test on the validation tab

of The Unscrambler® modeling dialog box.

Under cross validation, a number of submodels are created. These submodels are based on

all the samples that were not kept out in the cross validation segment. For every submodel,

a set of model parameters: B-coefficients, loadings and loading weights are calculated.

Variations over these submodels will be estimated so as to assess the stability of the results.

In addition a total model is generated, based on all the samples. This is the model that will

be used for interpretation.

For each variable one can calculate the difference between the B-coefficient Bi in a

submodel and the Btot for the total model. The Unscrambler® takes the sum of the squares of

the differences in all submodels to get an expression of the variance of the Bi estimate for a

variable.

With a t-test the significance of the estimate of Bi is calculated. Thus the resulting regression

coefficients can be presented with uncertainty limits that correspond to 2 standard

deviations under ideal conditions. Variables with uncertainty limits that do not cross the zero

line are significant variables.

The same can be done for the other model parameters, but there is a rotational ambiguity in

the latent variables of bilinear models. To be able to compare all the submodels correctly,

they are rotated back to the main model before the uncertainty is estimated. Therefore one

can also get uncertainty limits for these parameters.

401

The Unscrambler X Main

Stability plots

The results of all these calculations can also be visualized as stability plots in scores, loadings,

and loading weights plots. Stability plots can be used to understand the influence of specific

samples and variables on the model, and explain for example why a variable with a large

regression coefficient is not significant. This will be illustrated in the example that follows

(see Application Example).

Models with many components, three, four or more, may be difficult to interpret, especially

if the first factors (PCs) do not explain much of the variance.

For instance, if each of the first 4-5 PCs explain 15-20% of the variance, the factor 1/factor 2

plot is not enough to understand which are the most important variables.

In such cases, Martens’ automatic uncertainty test shows the significant variables in the

many-component model and interpretation is far easier.

Variables that are non-significant display non-structured variation, i.e. noise. When these

variables are removed, the resulting model will be more stable and robust (i.e. less sensitive

to noise). Usually the prediction error decreases too.

Therefore, after identifying the significant variables by using the automatic marking based

on Martens’ test, use The Unscrambler® function Recalculate with Marked (Right click on

equation node in project navigator, and select Recalculate- With Marked…) to make a new

model and check the improvements.

Application areas

Spectroscopic calibration works better if noisy wavelengths are removed.

Some models (not spectroscopic) may be improved by adding interactions and squares of

the variables, and The Unscrambler® has a feature to do this automatically. However, many

of these terms are irrelevant. Apply Martens’ uncertainty test to identify and keep only the

significant ones.

One of the critiques towards PLS regression has been the lack of significance of the model

parameters. Many years of experience have given “rules of thumb” of how to find which

variables are significant. However, these “rules of thumb” do not apply in all cases, and the

users still see the need for easy interpretation and guidance in these matters. The data

analysis must give reasonable protection against wishful thinking based on spurious effects

in the data. To be effective, such statistical validation must be easily understood by its user.

The modified Jack-knifing method implemented in The Unscrambler® has been invented by

Harald Martens, and was published in Food Quality and Preference Martens (1999). Its

details are presented hereafter.

Note: To understand this chapter requires a basic knowledge about the purposes

and principles of chemometrics. For those who have never worked with

multivariate data analysis before, it is strongly recommended that they begin

reading about it in the chapters about PCA and regression before proceeding with

this chapter.

402

Validation

See tutorial M to learn how to use the Uncertainty Test results in practice.

The cross validation assessment of the predictive validity is here extended to uncertainty

assessment of the individual model parameters: In each cross validation segment

m=1,2,…,M a perturbed version of the structure model described is obtained.

For more details refer to the method references chapter.

Each perturbed model is based on all the objects except one or more objects which were

kept ‘secret’ in this cross validation segment m.

If a perturbed segment model differs greatly from the common model, based on all the

objects, it means that the object(s) kept ‘secret’ in this cross validation segment have

significantly affected the common model. These left out objects caused some unique pattern

of variation in the model parameters. Thus, a plot of how the model parameters are

perturbed when different objects are kept ‘secret’ in the different cross validation segments

m=1,2,…,M shows the robustness of the common model against peculiarities in the data of

individual objects or segments of objects.

These perturbations may be inspected graphically in order to acquire a general impression of

the stability of the parameter estimates, and to identify dominating sources of model

instability. Furthermore, they may also be summarized to yield estimates of the

variance/covariance of the model parameters.

This is often called “jack-knifing”. It will here be used for two purposes:

Elimination of useless variables, based on the linear parameters B;

Stability assessment of the bilinear structure parameters T and P’, Q’.

It is also important to be able to assess the bilinear score and loading parameters. However,

the bilinear structure model has a related rotational ambiguity in the latent variables that

needs to be corrected for in the jack-knifing. Only then is it meaningful to assess the

perturbations of scores Tm and loadings Pm and Qm in cross validation model segment # m.

Any invertible matrix Cm (AxA) satisfies the relationships:

Therefore, the individual models m=1,2,…,M may be rotated, e.g. towards a common model:

After rotation, the rotated parameters T(m) and [P’,Q’](m) may be compared to the

corresponding parameters from the common model T and [P’,Q’]. The perturbations may

then be written as (T(m) - T)g and or ([P’,Q’](m) - [P’, Q’])g for the scores and the loadings,

respectively, where g is a scaling factor (here: g=1).

In the implemented code, an orthogonal Procrustes rotation is used. The same rotation

principle is also applied for the loading weights, W, where a separate rotation matrix is

computed for W. The uncertainty estimates for P, Q and W are estimated in the same

manner as for B below.

403

The Unscrambler X Main

On the basis of such jack-knife estimates of the uncertainty of the model parameters,

useless or unreliable X-or Y-variables may be eliminated automatically, in order to simplify

the final model and making it more reliable. The following part describes the cross validation

/ jack-knifing procedure:

When cross validation is applied in regression, the optimal rank A is determined based on

prediction of kept-out objects (samples) from the individual models. The approximate

uncertainty variance of the PCR and PLS regression coefficients B can be estimated by jack-

knifing

where

B (K x J) = the regression coefficient at the cross validated rank A using all the N

objects,

Bm (K x J) = the regression coefficient at the rank A using all objects except the

object(s) left out in cross validation segment m

g = scaling coefficient (here: g=(M-1)(M), where M is the number of cross-validation

segments).

Significance testing

When the variances for B, P, Q, and W have been estimated, they can be utilized to find

significant parameters.

As a rough significance test, a Student’s t-test is performed for each element in B relative to

the square root of its estimated uncertainty variance S²B, giving the significance level for

each parameter. In addition to the significance for B, which gives the overall significance for

a specific number of components, the significance levels for Q are useful to find in which

components the Y-variables are modeled with statistical relevance.

In The Unscrambler® validation is always automatically included in model computation.

However, what matters most is the choice of a relevant validation method for the particular

case (data set) being studied, and the configuration of its parameters.

The general validation procedure for PCA and regression is as follows:

Build a first model

Use segmented cross validation or leverage correction — the computations will go

faster. Allow for a large number of factors. Cross validation is recommended as it

also gives the ability to apply Martens’ Uncertainty Test.

Diagnose

the first model with respect to outliers, nonlinearities, any other abnormal behavior.

Take advantage of the variety of diagnostic tools available in The Unscrambler®

variance curves, automatic warnings, scores and loadings, stability plots, influence

plot, X-Y relation outliers plot, etc.

Investigate and fix problems

404

Validation

Check improvements

by building a new model.

For regression only: validate intermediate model with a full cross validation, using

Uncertainty Testing, then do variable selection based on significant regression

coefficients.

Validate final model

with a proper method (test set or full cross validation).

Interpret final model

in terms of sample properties, variable relationships, etc. Check RMSEP for

regression models.

Menu options, dialogs, plots for validation.

Validation is configured via the Validation tab for the respective analysis methods on the

Tasks - Analyze menu where one may choose a validation method and further specify

validation details.

Multiple Linear Regression (MLR)

Principal Component Regression (PCR)

Partial Least Squares regression (PLSR)

Support Vector Machine Regression (SVMR)

Support Vector Machine Classification (SVMC)

Linear Discriminant Analysis (LDA)

405

The Unscrambler X Main

Leverage Correction

A method used as a first pass model check. This should not to be used as a final

model validation method, as it an overly optimistic approximation.

Cross Validation

This method is used when either there are not enough samples available to make a

separate test set, or for simulating the effects of different validation test cases, e.g.

systematically leaving samples out vs. randomly leaving samples out, etc.

See Cross validation setup dialog usage

Test matrix

This is also known as Test Set Validation, and uses independent samples that have

not taken part in the calibration for validation. This allows one to define either a

new matrix, of the same number of variables, or a defined range within a single

matrix to be used as an independent check of model performance. Both X- and Y-

matrices need to be defined in this case. This is the preferred method for validation

and should be aimed for.

406

Validation

When running a cross validation with a PLSR or PCR regression, one can select to also

compute the prediction diagnostics for the cross validation segments by checking this

selection in the dialog. These are not available when full cross-validation is used. This option

will provide information on the validation results per each cross validation segment

including RMSEP, SEP, bias, slope, offset and correlation. The CV prediction diagnostics are

added as a matrix in the Validation folder of the PLSR model.

Significance testing

The Uncertainty Test option can be used to estimate the significance of variables, when

using cross validation. During cross validation, the differences between the model

parameters for all samples and the model for the samples in this particular cross validation

segment is squared and summed. The significance (p-value) is estimated by a t-test with the

model parameter and its standard deviation as input. For PCA the p-values for loadings per

variable and component are returned. For PLS regression p-values are returned for x-

loadings, loading weights, y-loadings and regression coefficients.

This is referred to as Martens’ Uncertainty Test.

Multiple Linear Regression Test Matrix Setup

Use the Matrix drop-down list to select the test set, or define it using the Rows and Column

selector drop-down lists to define a test set within a selected matrix for both X and Y.

407

The Unscrambler X Main

In The Unscrambler® X all results from the modeling are stored to have the maximum

flexibility in plotting any result matrix in any way to make the right decision regarding

outliers, interpretation of the model etc. However, as the size of data matrices become

large, the residual matrices use a lot of available memory and disk space, resulting in the size

of the Unscrambler project becoming large and sometime unmanageable. To enable the

user to reduce the size of models, there is an option for PCA, PCR and PLSR to discard

residuals

By discarding residuals, the matrices

X-Residuals

X-Validated Residuals

Y-Residuals

Y-Validated Residuals

are removed from the Validation folder in the analysis. These are 3-Dimensional matrices

and use up a lot of memory. As in indication of the reduced size when enabling Discard

Residuals, A PLS regression model with 400 samples and 100 x-variables, 1 y-variable and 10

factors will only take up 10% of the Full model size. As the number of samples, X- and Y-

variables and factors increase, the reduced-size model will be even smaller in percentage of

the full model.

Note: When the residuals are discarded, some of the plot options will not be available. All

plots where the data are taken from the X-Residuals or Y-Residuals matrices will not be

listed in the plot menus. The Plot - Residuals sub menus now only allows Residuals and

Influence (with Q-residuals), and under Plot -Residuals -General only Influence Plot and

Variance per Sample plots are available.

Plots available in the Residuals menu when Discard Residuals is selected

First, one should display the PCA or regression results as plots in the Viewer. When the

results plots have been opened in the Viewer one can access the Plot and the View menus

to select the various results to plot and interpret. Alternatively, the plots can be selected

from the Plots folder in the model node in the project navigator.

For more on these plots see the following sections:

Interpreting PLS regression plots

Results - PCA

Display the PCA Overview results. From here additional results plots can be accessed

from the menu.

408

Validation

Results - Regression

Display the PLSR Overview results. From here additional results plots can be

accessed from the menu.

Results - All

Display results for any analysis.

Plot - Variances and RMSEP

Plot variance curves and estimated Prediction Error (PCA, PCR, PLSR).

Plot - Predicted vs. Reference

Display plot of predicted Y values against actual Y values.

Plot Statistics

Display statistics (including RMSEP) on Predicted vs. Reference plot by using the

toolbar short cut.

Plot - Residuals

Display various types of residual plots.

Validation

Toggle Validation results on/off on current plot.

Calibration

Toggle Calibration results on/off on current plot.

Outlier Warnings

Display general warnings issued during the analysis – among others related to

validation. The Outlier Warnings are in the project navigator under the analysis

node.

First, one should display the PCA or regression results. When the results plots have been

opened in the Viewer one can access the Plot and the View menus to select the various

results to plot and interpret. Alternatively, the plots can be selected from the Plots folder in

the model node in the project navigator.

See tutorial M for a guide to uncertainty plots; variable selection and model stability.

Hotelling’s T² Ellipse

Display Hotelling’s T² ellipse on a scores plot using the toolbar short cut.

Uncertainty Test - Stability Plot

Display stability plot for scores or loadings using the toolbar short cut.

Plot - Important Variables

Display uncertainty limits on regression coefficients plot.

Correlation Loadings

Change a loadings plot to display correlation loadings by using the toolbar short cut.

409

The Unscrambler X Main

Full

Also known as Leave One Out (LOO) cross validation, this produces as many

calibration submodels as there are samples in the data set.

Random

One can choose the Number of segments a data set is to be divided up into and the

cross validation procedure randomly selects the number of samples to take, as

defined in the Samples per segment drop-down list. The number of segments may

be adjusted, depending on the size of the sample set and the number of samples to

take per segment.

Custom

Allows the user to manually choose the Number of Segments and to define the

samples for each segment by manual entry or by using the Select button. The Select

button takes one to the Define Range dialog box.

Systematic

The Unscrambler® provides two options for systematic sample selection.

Systematic (112233)

Allows the user to define the Number of segments and the Samples per segment. In

this case, the first N samples are removed for segment 1 and successfully replaced

for the number of segments defined. This is particularly useful when replicate

measures exist and are ordered together in the data matrix, allowing one to see the

impact of removing a complete replicate from a data set.

Systematic (123123)

Allows the user to look at the impact of removing a single replicate from a group of

replicate measures to assess the precision of the developed model.

410

Validation

Category variable

Allows for model cross validation by removing samples belonging to defined

categories as a group. This is useful for evaluating how robust the model is across

season, raw material supplier, location, operator etc.

411

10. Transform

10.1. Transformations

This section covers transformations available in The Unscrambler®. Transformation (or what

is often referred to as preprocessing) is applied to data to reduce or remove effects in data

which do not carry relevant information for the modeling of the system. Transformations

can reduce the complexity of a model (fewer factors needed) and improve the

interpretability of the data and models. Transformations include the application of

derivatives to spectral data to reduce baseline offset and tilt effects, while accentuating

small spectral differences. Scattering corrections are often used as transformations to

diffuse reflectance spectra to reduce differences such as light scatter and path length. These

transforms can only be performed on numerical data. Some of them cannot be performed

when there is missing data (i.e. Norris-Gap derivative).

The Unscrambler® provides the following transformations:

Baseline correction

Center_and_scale

Compute general

COW

Deresolve

Derivatives

Detrending

MSC/EMSC

Interaction & Square Effects

Missing_value_imputation

Noise

Normalize

OSC

Quantile_Normalize

Reduce and average

Smoothing

Spectroscopic transformations

SNV

Transpose

Weights

Interpolation

More details regarding transformation methods available in The Unscrambler® are given in

the Method References.

10.2.1 Baseline correction

Baseline corrections are used to adjust the spectral offset by either adjusting the data to the

minimum point in the data, or by making a linear correction based on two user-defined

variables.

413

The Unscrambler X Main

How it works

How to use it

Baseline corrections are used to adjust the spectral offset by either adjusting the data to the

minimum point in the data, or by making a linear correction based on two user-defined

variables. Baseline offset and Linear baseline correction are transformations used to correct

the baseline of samples, and are set in the dialog Tasks - Transform - Baseline. They are

mostly used for spectroscopic purposes. The two transformations can be executed

separately or together. In the combined case the Linear baseline correction will be run first,

then the Baseline offset.

Baseline offset

The formula for the baseline offset correction can be written as follows:

where x is a variable and X denotes all selected variables for this sample.

For each sample, the value of the lowest point in the spectrum is subtracted from all the

variables. The result of this is that the minimum value is set as 0 and the rest are positive

values. To use this consistently for a set of samples, make sure that the lowest point pertains

to the same variable for all samples.

Linear baseline correction

This transformation transforms a sloped baseline into a horizontal baseline. The technique is

to point out two variables which should define the new baseline. These are both defined as

0, and the rest of the variables are transformed according to this with linear

interpolation/extrapolation. It is important to take precautions not to select basis variables

that have spectroscopic bands. As for the offset correction, make sure that the lowest points

pertain to the same variables for all samples.

Baseline offset and Linear baseline correction are transformations used to correct the

baseline of samples, and are set in the dialog Tasks - Transform - Baseline. They are mostly

used for spectroscopic purposes. The two transformations can be executed separately or

together, but at least one transformation method must be selected. In the combined case

the Linear baseline correction will be run first, then the Baseline offset.

Baseline correction cannot be carried out with non-numeric data, but can proceed if there

are missing values in the data.

Baseline dialog

414

Transform

Begin by defining the data matrix from the drop-down list. This transform can also be

performed on a results matrix, which may be selected by clicking on the select result matrix

button . For the matrix, the rows and columns to be included in the computation

are then selected. If new data ranges need to be defined, choose Define to open the Define

Range dialog where new ranges can be defined. This transform requires that only numerical

data be chosen.

After the range has been selected, select the method of the baseline transformation. A

method must be selected in order to carry out the transform. If Linear baseline correction is

selected, the two variables which define the new baseline must also be defined (Baseline

end variables). The first and last variables are selected by default. The first and last values

must be different for the transform to be performed. By checking the Preview result, one

can see the outcome of the data when the baseline transformations has been applied.

When the baseline transformation is completed, a new matrix is created in the project with

the word Baseline appended to the original matrix name. This name may be changed by

selecting the matrix, right clicking and selecting Rename from the menu.

415

The Unscrambler X Main

Method options

Choose between two baseline transforms:

Baseline offset

Subtract the value of the lowest point in the spectrum is subtracted from all the

variables.

Linear baseline correction

Transform a sloped baseline into a horizontal baseline.

Do not select basis variables that have spectroscopic bands.

For the offset correction in both methods, make sure that the lowest points pertain to the

same variables for all samples.

10.3.1 Center_and_scale

Centering is often the first stage of multivariate modeling. It involves subtracting an average

value from each variable in order to investigate the variation around the average rather than

the absolute values of the observations. Depending on the data and the problem at hand,

other values than the mean may also be subtracted.

Scaling involves division of each variable by its estimated spread, using either the standard

deviation or other measures of variability. Scaling is particularly important if the variables

differ a lot in their relative magnitudes, as variables with larger variance are given more

influence in regression analysis.

How it works

How to use it

Centering using the average value, also called mean centering, ensures that the resulting

data or model may be interpreted in terms of variation around the mean. This is often the

preferred pre-processing method, as it focuses on differences between observations rather

than their absolute values. As a robust alternative to the mean, the median may be used

instead. The median will more likely put the origin in the ‘center of mass’ in cases where

some of the variables may be distributed non-symmetrically.

In some situations, for instance for chromatographic concentrations, it may not make sense

to use negative values at all. Subtraction of the minimum value will ensure non-negativity

for all variables.

The alternative to data centering is to keep the raw data origin for all variables. This is only

advisable in the special case of a regression model where it is known in advance that the

linear relationship between X and Y is expected to pass through zero. In The Unscrambler®

one may apply mean, median, minimum as a pre-processing step, or choose not to center

the data.

Scaling involves dividing the (centered) variables by individual measures of dispersion. Using

the Standard Deviation as the scaling factor sets the variance for each variable to one, and is

usually applied after mean centering. Other scaling options available in The Unscrambler®

are Interquartile Range (IQR), Range, and Scaled Median Absolute Deviation (MAD). All these

are non-parametric methods and are often used in combination with median centering.

416

Transform

The range is the difference between the highest and lowest observation for each variable.

Such scaling results in a range of one for all variables. The presence of outliers in the data

will heavily influence this transformation, however. A safer alternative would be to use the

IQR, which is the the difference between the observations at the 25th and 75th percentiles.

(There are several different ways of calculating the IQR, and The Unscrambler® utilizes the

‘Type 7’ algorithm of Hyndman and Fan, 1996.) As extreme observations are not included in

the IQR estimate, it is less likely to be affected by outliers.

The MAD is defined as the median of absolute differences between each observation in the

column and the median observation. This measure of population spread is little affected by

the tail behaviour of the distribution. For instance if a histogram of the data reveals a ‘wide’

peak where many observations fall in the tails, the standard deviation will be grossly inflated

while the MAD will remain a good estimate for the population’s spread. The MAD will

similarly be more robust for data with sharp peaks and long tails. The Scaled MAD is the

MAD multiplied with the factor 1.4826. This makes the estimate similar to the standard

deviation when many observations are collected from a normal distribution.

Centering and/or scaling data may be useful to study the data in various plots, or prior to

running Tasks – Analyze – Descriptive Statistics. It may for example allow one to compare

the distributions of variables of different scales within one plot. In subsequent analysis,

these scaled variables will contribute similarly to the model regardless of measurement unit.

These transformations are all column-oriented: the transformed values are computed as a

function of the values in the same column of the data table.

Notes: 1. Mean centering is included as a default option in the relevant analysis

dialogs, and the computations are done as a first stage of the analysis. Scaling using

the standard deviation may be applied in the Weights tabs of most analysis dialogs.

2. Centering and scaling are also available as a transformation to be performed

manually from the Editor (Tasks – Transform – Center_and_scale). Use this dialog

to perform one of the available non-parametric centering and scaling options.

A special type of standardization is the Spherize function Martinez and Martinez, 2005. It is

the multivariate equivalent of the univariate scaling methods described above. The

transformed variables have a p-dimensional mean of 0 and a covariance matrix given by the

identity matrix. It is also known in some application domains as the whitening

transformation since the resulting matrix has the signal properties of “white noise”.

More details regarding center and scale methods are given in the Method References.

Centering and/or scaling of data may be useful to study the data in various plots, or prior to

running Tasks - Analyze – Descriptive Statistics. Centering and scaling are widely applied in

order to transform the data to comparable levels and scale units prior to analysis.

These transformations are column-oriented: the transformed values are computed as a

function of the values in the same column of the table. They cannot be applied to non-

numeric data.

Center and Scale

417

The Unscrambler X Main

Begin by defining the data matrix from the drop-down list. This transform can also be

performed on a results matrix, which may be selected by clicking on the select result matrix

button . The rows and columns to be included in the computation must be

specified as well. If new data ranges need to be defined, choose Define to open the Define

Range dialog where new ranges can be defined.

In the Transformation frame, three options are available:

Center

within the selected sample and variable scope. This subtracts a value, e.g. the

variable mean, from each observation in each column. There is an option to center

by the mean, median, or minimum value, or not use any centering. Choose the

desired option for centering from the Center drop-down list.

Dialog showing centering options

Scale

within the selected sample and variable scope. This divides each data value by an

estimate of the of the column spread. Options available are the Standard deviation

(SDev), Interquartile range (IQR), Range, or Scaled median absolute deviation (MAD)

scaling, or not to use any scaling. Choose the desired option for scaling from the

Scale drop-down list as shown below.

Dialog showing scaling options

418

Transform

Spherize

This is a multivariate equivalent of univariate center and scaling, useful in

exploratory data analysis.

The Center and Scaling options can be selected either separately or in combination. Often

mean centering is combined with SDev scaling (autoscaling). Due to their non-parametric

nature, the Range, IQR, or Scaled MAD transformation is often used after median centering.

The type of centering and scaling is selected from the drop-down list.

By checking the Preview result box, a line plot of the observations before and after scaling is

displayed.

Notes: 1. To display the mean and standard deviation of the variables in a data set,

use menu option Tasks – Analyze- Descriptive Statistics. 2. The Center and Scale

transformations are supported in autopretreatments, meaning they can be

automatically applied when new data are analysed (classification, prediction and

sample projection analyses), using a model which was developed with this

transformation applied. See next note. 3. The principal component analysis (PCA)

and Regression dialog boxes include options for centering and scaling variables

directly at the analysis stage. It is recommended to perform centering and scaling at

the model-building stage, especially if the model will be used for future prediction

or classification. The same centering and scaling options will be applied as when the

model was built. 4. Centering and/or scaling the data more than once will not affect

the structure of the data any further. Consequently, if the Center and Scale

transformation has been applied to the data from the Tasks – Transform – Center

and Scale dialog, the data may harmlessly be recentered and/or rescaled at the

modeling stage (PCA or regression).

10.4.1 Compute general

The transform Compute_General can be used to make general mathematical

transformations to samples and/or variables.

How it works

419

The Unscrambler X Main

How to use it

One can use the transform Compute_General to make computations on selected samples,

variables or a matrix range using basic elementary and trigonometric functions.

Additional functions for computation on the entire data matrix are available with the Matrix

calculator: Tools - Matrix Calculator… has options for linear algebra, matrix operations and

reshaping of data.

This opens the Compute dialog, where one can perform arithmetic and more advanced

computations on the whole data matrix or on selected rows (samples) or columns

(variables). This option also helps in transforming variables. Computations cannot be

performed with non-numeric data.

Compute_General

Begin by defining the data matrix from the drop-down list. This transform can also be

performed on a results matrix, which may be selected by clicking on the select result matrix

button . For the matrix, the rows and columns to be included in the computation

are then selected.

If new data ranges need to be defined, choose Define to open the Define Range dialog

where new ranges can be defined. One must also define if the selection is for the variables

or samples.

There are three ways of defining the mathematical expression to be applied:

420

Transform

Use the drop-down list, which provides the most recently used expressions (if this is

the first time using the Compute_General dialog, no formerly used expressions will

show in the drop-down list).

Click on the Build Expression button. This opens the Build Expression dialog wherein

a mathematical expression can be defined using the ready-made functions and

operators allowed in The Unscrambler®.

Syntax

The Expression field accepts a formula of the type: X=LN(ABS(X))-e or S4=(S1*S2)+S3 or

V1=V1/2+SIN(V8/V9) where S stands for sample, V stands for variable, and the number is

the sample or variable number in the Editor. To build general expressions that are not

related to a particular sample or variable, use X. X stands for the whole matrix defined by the

variable and sample set chosen in Scope. RH and CH are row and column headers,

respectively.

Note: The formula cannot contain mixed references to samples (S), variables (V)

and X.

The constants, operators, and functions that are allowed in computations are listed below:

Table: Operators, functions and constants allowed in computations

Name Description

+ Addition

- Subtraction

* Multiplication

/ Division

= Equals to

( Left Parenthesis

) Right Parenthesis

EXP(X) Exponential(X)=eX

421

The Unscrambler X Main

Name Description

COS(X) Cosine

SIN(X) Sine

TAN(X) Tangent

PI 3.14

e 2.718

”X” can denote both samples and variables in this table.

Function names are case insensitive, meaning that log, Log, and LOG will give the same

result. In the above functions a comma is used as list separator, however this depends on

the regional settings of the computer. Different list separators may be valid for different

contries, e.g. POW(X;n).

Notes: A commonly used expression is X=log(X). This expression generally

transforms skewed variable distributions into more symmetrical ones. Use a

histogram plot or Tasks – Analyze – Descriptive Statistics… in order to check

whether the skewness was improved or deteriorated after applying the

transformation.

In the Expression Builder dialog a mathematical expression can be built using the ready-

made functions and operators allowed in The Unscrambler®.

Expression Builder

422

Transform

The upper text field shows the expression as it is being built. In Display, choose whether the

text field should show the sample/variable Numbers or the sample/variable Names. In the

Insert field, choose to insert specific samples, specific variables or (general expression). After

choosing the Sample or the Variable options, the drop-down list is enabled and one can

select the relevant object(s) from the list. The available samples or variables are only those

belonging to the Scope formerly selected in the Compute dialog.

The Arithmetic Functions, Trigonometric Functions, Other Functions, and Numbers fields

offer buttons that are used following the same principle as for a calculator.

Click Clear to clear the expression. Click Undo to undo the latest insertion in the expression

text. Click OK to return to the Compute_General dialog.

10.5. COW

10.5.1 Correlation Optimized Warping (COW)

COW is a method for aligning data where the signals exhibit shifts in their position along the

x axis. COW cannot be performed with non-numeric data, or when there are missing data.

How it works

How to use it

423

The Unscrambler X Main

COW is a method for aligning data where the signals exhibit shifts in their position along the

x axis. COW can be used to eliminate shift-related artifacts to measurement data by

correcting a sample vector to a reference. COW has applicability to data where there can be

a poor alignment of the x axis from sample to sample, as can be the case with

chromatographic data, Raman spectra and NMR spectra. One example of such data is

chromatography where peak positions change between samples due to changes in mobile

phase or deterioration of the column. Another example is in NMR spectroscopy where

matrix effects and the chemistry itself induce position changes in the chemical shifts.

The method works by finding the optimal correlation between defined segments of the data

for which there is a shift in position. The result of this procedure is one shift value per

segment. These are then interpolated to give a so-called shift-vector for all data points, and

a mapping function (move-back operator) which moves the samples back to the reference

profile’s position. The present implementation handles data of similar length only. To cope

with various lengths, it is suggested to pad the data table out with zeros before performing

the shift alignment. Alignment is done by allowing small changes in the segment length on

the sample vector, and those segment lengths being shifted (“warped”) to optimize the

correlation between the sample and the reference vector. Slack refers to the maximum

increase or decrease in sample segment length, and provides flexibility in optimizing the

correlation between the samples and reference.

The reference sample is the sample in the data which is used as the reference, and should be

a representative sample with the main peaks present.

Segment length is defined by the user, and is the size of the data segment that data are

divided into before searching for the optimal correlation. It must be smaller than the

number of variables divided by 4.

The slack is the flexibility in adjusting the segment size to give the optimal fit to the

reference data, and is the allowed change in position to be searched for. Slack is <= segment.

The figures below illustrate the result of applying the COW preprocessing to chromatograms.

Raw chromatograms

424

Transform

Correlation Optimized Warping (COW) is a row-oriented transformation for aligning data

where the signals exhibit shifts in their position along the x axis. This can be applicable to

data sets where there may be differences due to alignment differences that arise from the

425

The Unscrambler X Main

measurement (such as in chromatography retention times, chemical shifts in NMR data, and

Raman spectral x axis alignment).

COW cannot be performed with non-numeric data, or when there are missing data. The

minimum number of variables required to use COW is 20.

COW Dialog

Begin by defining the data matrix from the drop-down list. This transform can also be

performed on a results matrix, which may be selected by clicking on the select result matrix

button . For the matrix, the rows and columns to be included in the computation

are then selected. If new data ranges need to be defined, choose Define to open the Define

Range dialog where new ranges can be defined.

Three inputs must be specified in the dialog:

Reference Sample: Select which sample in the data table is to act as the reference

profile.

This is a typical sample (e.g. near the origin in a scores plot) with preferably the main

peaks present. If the COW will be applied to new data at some later point of time,

include the reference sample in a new data table as well.

Segment Size. This is the length of the segment which the data are divided into

before searching for the optimal correlation. It must be smaller than the number of

variables divided by 4.

Slack: Slack represents the allowed change in position to be searched for and has the

value <= Segment Size.

By selecting the preview result, one can see how the transformed data will look.

COW dialog with preview

426

Transform

When the COW transformation is completed, a new matrix is created in the project with the

word COW appended to the original matrix name. This name may be changed by selecting

the matrix, right clicking and selecting Rename from the menu.

10.6. Deresolv

10.6.1 Deresolve

The Deresolve function can be used to change the apparent resolution of an instrument,

changing a high resolution spectrum to low resolution. It may also be used for noise

reduction.

How it works

How to use it

427

The Unscrambler X Main

On occasion, one may wish to standardize a lower resolution instrument to a higher

resolution instrument. This may be the case when transferring data from one instrument to

another with the intention of calibration model transfer. In such an instance, it may be more

effective to mathematically lower the resolution of the higher resolution instrument prior to

forming the transfer model. The Deresolve function can be used to change the apparent

resolution of an instrument, changing a high resolution spectrum to a lower resolution by

downsampling the signal. Deresolve may also be used for noise reduction.

Deresolve uses a triangle kernel filter for smoothing to convolve spectra with a resolution

function in order to make it appear as if it had been taken on a lower resolution instrument.

The inputs are the high resolution spectra to be deresolved and the number of channels to

convolve them over. The output is the estimate of the lower resolution spectra with the

original number of variables maintained.

More details regarding the Deresolve method are given in the Method References.

The Deresolve function can be used to change the apparent resolution of an instrument,

changing a high resolution spectrum to low resolution. It may also be used for noise

reduction. It is a row-oriented transformation; that is to say the contents of a cell are likely

to be influenced by its horizontal neighbors. This transformation cannot be applied to non-

numeric data.

A new data matrix with the deresolved data will be created in the project where the original

data matrix resides.

Deresolve

428

Transform

Begin by defining the data matrix from the drop-down list. This transform can also be

performed on a results matrix, which may be selected by clicking on the select result matrix

button . For the matrix, the rows and columns to be included in the computation

are then selected. If new data ranges need to be defined, choose Define to open the Define

Range dialog where new ranges can be defined. There must be at least 4 variables to

perform the deresolve transformation.

In the Parameters field, choose the number of channels to use for convolution. The

minimum number of channels that can be used is 2, and the maximum is (#variables/2)

By selecting the preview result, one can see how the transformed data will look.

When the deresolve transformation is completed, a new matrix is created in the project with

the word Deresolve appended to the original matrix name. This name may be changed by

selecting the matrix, right clicking and selecting Rename from the menu.

10.7. Derivatives

10.7.1 Derivatives

Differentiation, i.e. computing derivatives of various orders, is a classical technique widely

used for spectroscopic applications. Some of the information “hidden” in a spectrum may be

more easily revealed when working on a first or second derivative. It is a row-oriented

429

The Unscrambler X Main

transformation; that is to say the contents of a cell are likely to be influenced by its

horizontal neighbors.

Derivatives cannot be performed with non-numeric data or where there are missing data.

Like smoothing, this transformation is relevant for variables which are themselves a function

of some underlying variable, e.g. absorbance at various wavelengths. Computing a derivative

is also called differentiation. Derivatives can help to resolve overlapped bands, but also lead

to a lower signal in the transformed data.

The segment parameter of Gap-Segment derivatives is an interval over which data values are

averaged.

In smoothing, X-values are averaged over one segment symmetrically surrounding a data

point. The raw value on this point is replaced by the average over the segment, thus creating

a smoothing effect.

In Gap-Segment derivatives (designed by Karl Norris), X-values are averaged separately over

one segment on each side of the data point. The two segments are separated by a gap. The

raw value on this point is replaced by the difference of the two averages, thus creating an

estimate of the derivative on this point.

The Unscrambler® offers three methods for computing derivatives, as described in the

following sections:

Gap_Derivatives

Gap-Segment

Savitzky-Golay

Derivatives are applied to correct for baseline effects in spectra for the purpose of removing

nonchemical effects and creating robust calibration models. Derivatives may also aid in

resolving overlapped bands which can provide a better understanding of the data,

emphasizing small spectral variations not evident in the raw data.

The first derivative

The first derivative of a spectrum is simply a measure of the slope of the spectral curve at

every point. The slope of the curve is not affected by purely additive baseline offsets in the

spectrum, and thus the first derivative is a very effective method for removing such offsets.

However, peaks in raw spectra usually become zero-crossing points in first derivative

spectra, which can be difficult to interpret.

Example:

To illustrate how derivatives work, Gaussian curves of various offsets and intensities are

used to demonstrate the principles. These curves are shown below.

Gaussian curves of various offsets and intensities

430

Transform

Mathematically, a derivative is the slope of the curve. If purely additive noise (like in the

curves above) is present, this is a constant. Therefore under derivatization, the constant

reduces to zero, meaning that all spectra should have a mean of zero and the spectral

profiles should be changed to the slopes of the curves.

The next figure displays the first order derivative for the Gaussian curves.

First derivative of Gaussian curves

The peak maxima in the raw data has now become a zero point in the derivative.

The zero point can be explained by the fact that at a peak maxima (minima), the derivative is

zero.

In complex spectra, there may be many zero points and while it is adequate to transform a

purely linear offset with a first derivative, interpretation of zero points becomes difficult.

The second derivative may be useful in this instance.

431

The Unscrambler X Main

The second derivative is a measure of the change in the slope of the curve. In addition to

removing pure additive offset, it is not affected by any linear “tilt” that may exist in the data,

and is therefore a very effective method for removing both the baseline offset and slope

from a spectrum. The second derivative can help resolve nearby peaks and sharpen spectral

features. Peaks in raw spectra change sign and turn to negative peaks with lobes on either

side in the second derivative.

Example:

Returning to the Gaussian curves, the second derivative can be conceptualized as the slope

of the first derivative. Therefore at the zero point in the first derivative, the slope is

maximum and in this case will result in the original raw data maxima being minima in the

second derivative. The figure below demonstrates this.

Second derivative of Gaussian curves

Another important feature of the second derivative is that the intensities of the original

curves can be seen in the second derivatives in order of intensity. This is an extremely useful

property, especially when performing quantitative analyses such as regression analysis.

Third and fourth derivatives

Third and fourth derivatives are available in The Unscrambler® although they are not as

popular as first and second derivatives. They may reveal phenomena which do not appear

clearly when using lower-order derivatives and can be helpful in understanding the spectral

data. Prudent use of the fourth derivative has been shown to emphasize small variations

caused by temperature changes and compositional changes. Higher-order derivatives do

significantly reduce the signal in the transformed data.

Savitzky-Golay vs. Gap-Segment

The Savitzky-Golay method and the Gap-Segment method use information from a localized

segment of the spectrum to calculate the derivative at a particular wavelength rather than

the difference between adjacent data points. In most cases, this avoids the problem of noise

enhancement from the simple difference method and may actually apply some smoothing to

the data.

The Gap-Segment method requires gap size and smoothing segment size (usually measured

in wavelength span, but sometimes in terms of data points). The Savitzky-Golay method uses

a convolution function, and thus the number of data points (segment) in the function must

432

Transform

be specified. If the segment is too small, the result may be no better than using the simple

difference method. If it is too large, the derivative will not represent the local behavior of

the spectrum (especially in the case of Gap-Segment), and it will smooth out too much of the

important information (especially in the case of Savitzky-Golay). Although there have been

many studies done on the appropriate size of the spectral segment to use, a good general

rule is to use a sufficient number of points to cover the full width at half height of the largest

absorbing band in the spectrum. One can also find optimum segment sizes by checking

model accuracy and robustness under different segment size settings.

Example:

Using data from a FT-NIR spectrometer, the next figure shows what happens when the

selected segment size is too small (Savitzky-Golay derivative, 3 points segment and second

order of polynomial). Noisy features remain in the spectra when the . segment size is too

small

Derivatized data with a segment size set too small

In the figure that follows, the selected segment size is too large: (Savitzky-Golay derivative,

31 points segment and second order of polynomial). One can see that some relevant

information has been smoothed out.

Derivatized data with a segment size set too large

433

The Unscrambler X Main

The main disadvantage of using derivative preprocessing is that the resulting spectra can be

difficult to interpret. However, this can also be advantageous, especially when a user is

looking for both specificity and selectivity of particular constituents in complex sample

matrices.

More details regarding Derivative transforms are given in the Method References.

Gap derivative

This is a special case of Gap-Segment Derivative with segment size = 1 and therefore does

not smooth the data. This derivative requires that the data all be numeric and that there are

at least five variables for each sample, and no missing values.

Properties of Gap-segment and Gap derivatives

Karl Norris has developed a powerful approach for the pretreatment of near-infrared

spectral data in which two distinct items are involved. The first is the Gap Derivative, the

second is the “Norris Regression”, which may or may not use the derivatives. The Gap

Derivative is applied to improve the rejection of interfering absorbers. The “Norris

Regression” is a regression procedure to reduce the impact of varying baseline, variable path

lengths, and high stray light among samples due to scatter effects. .

In Gap-Segment derivatives (designed by Karl Norris), X-values are averaged separately over

one segment on each side of the data point. The two segments are separated by a gap. The

raw value on this point is replaced by the difference of the two averages, thus creating an

estimate of the derivative on this point.

Tasks – Transform – Derivative – Gap Derivative

This method computes derivatives of up