Multivariate Data Analysis In Practice 5th Edition

(1)

Multivariate Data Analysis

– In Practice

5

th

Edition

An Introduction to

Multivariate Data Analysis

and Experimental Design

Kim H. Esbensen

Ålborg University, Esbjerg

with contributions from

Dominique Guyot

Frank Westad

Lars P. Houmøller

www.camo.com

CAMO Software AS. Nedre Vollgate 8, N-0158,

Oslo, NORWAY

Tel: (47) 223 963 00

CAMO Software Inc. One Woodbridge Center, Suite 319,

Woodbridge, NJ 07095, USA

Tel: (732) 726 9200

CAMO Software India Pvt. Ltd. 14 & 15, Krishna Reddy Colony Domlur Layout,

Bangalore - 560 071, INDIA

(2)

Excel were used to make some of the illustrations. The screen captures were taken with Paint Shop Pro.

Trademark Acknowledgments

Doc-To-Help is a trademark of WexTech Systems, Inc.

Microsoft is a registered trademark and Windows 95, Windows NT, Excel and Word are trademarks of the Microsoft Corporation.

PaintShop Pro is a trademark of JASC, Inc. Visio is a trademark of the Shapeware Corporation.

Information in this book is subject to change without notice. No part of this document may be reproduced or transmitted in any form or by any means, electronic or mechanical, for any purpose, without the express written permission of CAMO Process AS.

ISBN 82-993330-3-2

(3)

Preface iii

Preface

October 2001

Learning to do multivariate data analysis is in many ways like learning to drive a car: You are not let loose on the road without mandatory training, theoretical and practical, as required by current concern for traffic safety. As a minimum you need to know how a car functions and you need to know the traffic code. On the other hand, everybody would agree that it is first after having obtained your drivers’ license that the real practical learning begins. This is when your personal experience really starts to accumulate. There is a strong interaction between the theory absorbed and the practice gained in this secondary, personal training period.

Please substitute ”multivariate data analysis” for ”driving a car” in all of the above. Neither in this context are you let out on the data analytical road – without mandatory training, theoretical and practical. The analogy is actually very apt!

This book presents a basic theoretical foundation for bilinear (projection-based) multivariate data modeling and gives a conceptual framework for starting to do your own data modeling on the data sets provided. There are some 25 data sets included in this training package. By doing all exercises included you’re off to a flying start!

Driving your newly acquired multivariate data analysis car is very much an evolutionary process: this introductory textbook is filled with illustrative examples, many practical exercises and a full set of self-examination real-world data analysis problems (with corresponding data sets). If, after all of this, you are able to work confidently on your own applications, you’ll have reached the goal set for this book.

(4)

This is the 5th_{revised edition}_{of this book. The three first editions were}

mainly reprints, the only major change being the inclusion of a completely revised chapter on ”Introduction to experimental design”, which first appeared in the 3rd edition (CAMO). The 4th_revised

edition however (published March 2000) saw very many major extensions and improvements:

•Text completely rewritten by the senior author, based on five years of extensive use in teaching at both university and dedicated course levels. More than 5.500 copies in use.

•30% new theory & text material added, reflecting extensive student response, full integration of PCA, PLS1 & PLS2 NIPALS algorithms and explanations.

•Text revised with an augmented self-learning objective throughout.

•Four new master data sets added (with extended self-exercise potential):

1. Master violin data (PCA/PLS) 2. Norwegian car dealerships (PCA/PLS)

3. Vintages (PCA/PLS)

4. Acoustic chemometric calibration (PCR/PLS)

•Additional chapter on experimental design: new features include mixture designs and D-optimal designs.

•New chapter on the powerful, novel: ”Martens’ Uncertainty Test”.

•Comprehensive glossary of terms.

This 5th edition also includes essential additional revisions and improvements:

•Lars P. Houmøller, Ålborg University Esbjerg, has carried out a complete work-through of all demonstrations and exercises. Many of these had not been updated with respect to several of the intervening UNSCRAMBLER software versions. We are happy to have finally eliminated this most frustrating nuisance.

(5)

Preface v

About the authors

Kim H. Esbensen, Ph.D., has more than 20 years of experience in multivariate data analysis and applied chemometrics. He was professor in chemometrics at the Norwegian Telemark Institute of Technology (HIT/TF), Institute of Process Technology (PT) 1995-2001, where he was also head of the Chemometrics Department Tel-Tek, Telemark Industrial R&D Center, Porsgrunn. Between these institutions he founded ACRG: the Applied Chemometrics Research Group, HIT/TF-Tel-Tek, which a.o. hosted SSC6, the 6th_{Scandinavian Symposium on}

Chemometrics, August 1999 as well as numerous other international courses, workshops and meetings.

July 1st, 2001 he moved to a position as research professor in Applied Chemometrics at Ålborg University, Esbjerg, Denmark (AUE), where he is currently leading ACACSRG: the Applied Chemometrics, Analytical Chemistry and Sampling Research Group. As the name implies, applied chemometrics activities continue in Esbjerg while new activities are added – most notably through close collaboration with assoc. prof. Lars P. Houmøller, who independently built up the area of analytical chemistry/chemometrics at AUE before Prof. Esbensen’s arrival. Most recently the discipline of sampling (proper sampling) has been added, in recognition of the immense importance of sampling in any data analytical discipline, including chemometrics.

Kim H. Esbensen has published more than 60 papers and technical reports on a wide range of chemical, geochemical, industrial, technological, remote sensing, image analytic and acoustic chemometric applications. Together with Paul Geladi he has been instrumental in co-developing the concept of Multivariate Image Analysis (MIA); with ACRG he pioneered the development of the novel area of acoustic chemometrics.

His M. Sc. is from the University of Aarhus, Denmark in 1978 (geology, geochemistry), while a Ph.D. was conferred him by the Technical University of Denmark (DTH) in 1981 within the areas of metallurgy, meteoritics and multivariate data analysis. He then did post-doctoral work for two years with the Research Group for Chemometrics at the University of Umeå 1980-1981, after which he worked in a Swedish geochemical exploration company, Terra Swede, for two more years. Moving to Norway, this was followed by eight years as data analytical research scientist at the Norwegian Computing Center (NCC), Oslo,

(6)

after which he became a senior research scientist at SINTEF, the Norwegian Foundation for Industrial and Technological Research for four additional years. In between these two assignments he was a visiting guest professor at Norsk Hydro’s Research Center in Bergen, Norway. He also holds a position as Chercheur associé (now Chercheur affilié) du Centre de Recherche en Géomatique, Université Laval, Quebec. He is a member of the editorial board of Journal of Chemometrics, Wiley Publishers, and is a member of ICS, AGU and several other geological, data analytical and statistical associations.

Dominique Guyot, educated in Statistics, Economics and

Biomathematics (ENSAE and Université de Paris 7, France), has 15 years of experience in the field of chemometrics. She gained industrial experience from her work in the pharmaceutical and cosmetic industries, before joining CAMO from 1995 until 2000. With CAMO, Dominique worked as a Senior Consultant, and was particularly involved in food applications. She put together a practical strategy for efficient product development, based on experimental design and multivariate data analysis. This strategy was implemented in the Guideline®+ software package, complemented by an integrated training course focusing on multivariate methods for food product developers. Dominique is now studying music and singing at the Conservatoire of Trondheim, Norway.

Frank Westad has a M. Sc. in physical chemistry from the University of Trondheim, Norway. He has 13 years experience in applied multivariate data analysis, and he completed a Ph.D. in multivariate regression in 2000. Frank has given numerous courses in experimental design and multivariate analysis for companies in Europe and in the U.S.A. His main research fields include variable selection, shift modelling and image analysis.

Lars P. Houmøller has a M.Sc. in chemistry and physics from the University of Aarhus, Denmark. He has 12 years of experience in analytical chemistry and has worked 5-7 years with chemometrics. His teaching experiences include chemometrics, analytical chemistry, spectroscopy, physical chemistry, general and technical chemistry, organic and inorganic chemistry, unit operations and fluid dynamics. His research field covers NIR spectroscopic applications over a very broad industrial spectrum. He also has experience from working in the Danish food production industry.

(7)

Preface vii

E-mail interaction with the authors:

Kim Esbensen [email protected]

Dominique Guyot [email protected] Frank Westad [email protected]

Lars P. Houmøller [email protected]

About this book

Since 1986, when CAMO ASA first commercialized and started marketing THE UNSCRAMBLER, many customers have asked for basic, easy-to-understand literature on chemometrics. In 1993 a group of data analysts at different competence levels was invited to a one-day seminar at CAMO, Trondheim, for discussing their experience from both learning and teaching chemometrics. The result was a blue-print outline for what came to be this introductory book: the specifications called for a comprehensive training-package, involving basic, practical, easy-to-read, largely non-mathematical theory, with plenty of hands-on examples and exercises on real-world data sets. CAMO contracted SINTEF to write this book (first three editions), and the parties agreed to cooperate on the completion of the complete training package.

In the intervening years, this book was published in some 4.500 copies and was used for the introductory basic training in some 15 universities and in several hundred industrial companies; reactions were many and largely constructive. We learned a lot from these criticisms; we thank all who contributed!

Came 1999, the time was ripe for a complete revision of the entire package. This was undertaken by the senior author in the summer 1999 with significant assistance from his then Ph.D. student Jun Huang (now with CAMO, Norway); Frank Westad (Matforsk) who wrote chapter 14 (Martens’ Uncertainty Test), Dominique Guyot (CAMO) who wrote the original new entire chapter 17 (Complex Experimental Design Problems), and with further invaluable editorial and managerical contributions from Michael Byström (CAMO) and Valérie Lengard (CAMO). A most sincere thank you goes to Peter Hindmarch (CAMO,

UK) for very effective linguistic streamlining of the 4th edition! The authors and CAMO also take this opportunity to acknowledge Suzanne Schönkopf’s (CAMO) contribution to editions previous to the 4th one.

(8)

The present edition of this book still bears the fruit of her very important past efforts.

The publication of the 4th_{edition, in March 2000, was unfortunately}

somewhat marred by a less than complete revision of the exercises and illustrative UNSCRAMBLER runs in the book, which was not considered fatal at the time – This soon proved to be a serious mistake; disapointment and frustration from several generations of students, who wanted to follow all the exercises closely, followed rapidly. A Danish university teacher, who had himself experienced this frustration close up when using the book for his own teachings, assoc. prof. Lars P. Houmøller at the University of Ålborg, Esbjerg voluntarily took it upon himself to carry out a complete work-through of this essential didactic aspect of the book. His very valuable demo and exercise revisions, as well as a very thorough text consistency check, have now been included

in toto in the 5th_edition.

Today, this book is a collaborative effort between the senior author and CAMO Process AS; the tie with SINTEF is now defunct.

There is little academic glamour in writing an introductory level textbook, as the senior author has well experienced - which was never the goal anyway. But on the other hand, the introductory level is definitely where the largest audience and potential market exist, as CAMO has well experienced. The senior author has used the book for six consecutive years teaching introductory chemometrics largely to engineering (M.Sc.) students, as well as for extensive course work in industrial and foreign university environments. The response from some accumulated 500 students has made this author happy, while some 5500 sales have made CAMO equally satisfied.

Thus all is well with the training package! We hope that this revised 5th edition will continue to meet the challenging demands of the market, hopefully now in an improved form. Writing for precisely this introductory audience/market constitutes the highest scientific and didactic challenge, and is thus (still) irresistible!

(9)

Preface ix

Acknowledgements

The authors wish to thank the following persons, institutions and companies for their very valuable help in the preparation of this training package:

Hans Blom, Østlandskonsult AS, Fredrikstad, Norway

Frode Brakstad, Norsk Hydro F-Center, Porsgrunn, Norway

Rolf Carlson, Department of Chemistry, University of Tromsø, Norway

Chevron Research & Technology Co, Richmond, CA, USA

Lennart Eriksson, Dept. of Organic Chemistry, University of Umeå, Sweden (now with Umetrics, Inc.)

Professor Magni Martens, The Royal Vetarinary & Agricultural University, Denmark

Geological Survey of Greenland, Denmark

IKU, Institute for Petroleum Research, Trondhein, Norway

Norwegian Food Research Institute (MATFORSK), Ås, Norway

Norwegian Society of Process Control Norwegian Chemometrics Society International Chemometrics Society UOP Guided Wave, CA, USA

Pierre Gy, Cannes,France (for a gentleman’s introduction to the finest French wines)

Zander & Ingerstrõm, Oslo, Norway

Tomas Õberg Konsult AB, Karlskoga, Sweden

KAPITAL (weekly Norwegian economic magazine), no 14/1994, p50-55

Hlif Sigurjonsdottir, Reykjavik, Iceland (owner of G. Sgarabotto “violin no 9”)

Birgitta Spur, LSO, Reykjavik, Iceland (permission to use the Sgarabotto

oeuvre data)

Sensorteknikk A/S, Bærum, Oslo (Bjørn Hope: sensor technology entrepreneur extraordinaire; Evy: for innumerable occasions: warm company, coffee and waffles, waffles, waffles)

Thorbjørn T. Lied, Maths Halstensen, Tore Gravermoen, Rune Mathisen a.o. (for enormous help in developing acoustic chemometrics)

“Anonymous wine importer”, Odense, Denmark.

Helpful wine assessors (partly anonymous), Manson, Wa, USA.

Finally the author(s) and CAMO wish to thank all THE UNSCRAMBLER users during the last seven years for their close relationships with us, which have given us so much added experience in

(10)

teaching multivariate data analysis. And thanks for all the constructive criticism to the earlier editions of this book. Last, but certainly not least, a warm thank you to all the students at HIT/TF, at Ålborg University, Esbjerg and many, many others, who have been associated with the teachings of the authors, nearly all of whom have been very constructive in their ongoing criticism of the entire teaching system embedded in this training package. We even learned from the occasional not-so-friendly criticisms…

Communication

The period of seven years that has been the formative period for the training package has come of age. By now we are actually beginning to be rather satisfied with it!

And yet: The author(s) and CAMO always welcome all critical responses to the present text. They are seriously needed in order for this work to be continually improving.

(11)

Contents xi

3. Principal Component Analysis (PCA) –

Introduction

19

3.1 Representing the Data as a Matrix 19 3.2 The Variable Space - Plotting Objects in p Dimensions 20 3.3 Plotting Objects in Variable Space 21 3.3.1 Exercise - Plotting Raw Data (People) 22 3.4 The First Principal Component 27 3.5 Extension to Higher-Order Principal Components 30 3.6 Principal Component Models - Scores and Loadings 31

3.6.1 Model Center 32

3.6.2 Loadings - Relations Between X and PCs 33 3.6.3 Scores - Coordinates in PC Space 34

3.6.4 Object Residuals 35

3.7 Objectives of PCA 35

3.8 Score Plot - “Map of Samples” 36 3.9 Loading Plot - “Map of Variables” 40

(12)

3.10 Exercise: Plotting and Interpreting a PCA-Model (People) 47

3.11 PC-Models 54

3.11.1 The PC Model: X = TP T + E = Structure + Noise 54 3.11.2 Residuals - The E-Matrix 58 3.11.3 How Many PCs to Use? 61 3.11.4 Variable Residuals 64 3.11.5 More about Variances - Modeling Error Variance 65 3.12 Exercise - Interpreting a PCA Model (Peas) 66 3.13 Exercise - PCA Modeling (Car Dealerships) 68 3.14 PCA Modeling – The NIPALS Algorithm 72

4. Principal Component Analysis (PCA) - In Practice

75

4.1 Scaling or Weighting 75

4.2 Outliers 78

4.2.1 Scaling, Transformation and Normalization are Highly

Problem Dependent Issues 80

4.3 PCA Step by Step 81

4.3.1 The Unscrambler and PCA 84

4.4 Summary of PCA 85

4.4.1 Interpretation of PCA-Models 88 4.4.2 Interpretation of Score Plots – Look for Patterns 89 4.4.3 Summary - Interpretation of Score Plots 93 4.4.4 Summary - Interpretation of Loading Plots 94 4.5 PCA - What Can Go Wrong? 95 4.6 Exercise - Detecting Outliers (Troodos) 97

5. PCA Exercises – Real-World Application Examples 105

5.1 Exercise - Find Clusters (Iris Species Discrimination) 105 5.2 Exercise - PCA for Experimental Design (Lewis Acids) 107 5.3 Exercise - Mud Samples 109 5.4 Exercise - Scaling (Troodos) 112

6. Multivariate Calibration (PCR/PLS)

115

6.1 Multivariate Modeling (X,Y): The Calibration Stage 115 6.2 Multivariate Modeling (X, Y): The Prediction Stage 116 6.3 Calibration Set Requirements (Training Data Set) 118 6.4 Introduction to Validation 120 6.5 Number of Components (Model Dimensionality) 122 6.6 Univariate Regression (y|x) and MLR 124

(13)

Contents xiii

6.6.1 Univariate Regression (y|x) 124 6.6.2 Multiple Linear Regression, MLR 125

6.7 Collinearity 127

6.8 PCR - Principal Component Regression 128 6.8.1 Exercise - Interpretation of Jam (PCR) 130

6.8.2 Weaknesses of PCR 136

6.9 PLS- Regression (PLS-R) 137 6.9.1 PLS - A Powerful Alternative to PCR 137 6.9.2 PLS (X,Y): Initial Comparison with PCA(X), PCA(Y) 137 6.9.3 PLS2 – NIPALS Algorithm 139 6.9.4 Interpretation of PLS Models 143 6.9.5 The PLS1 NIPALS Algorithm 144 6.9.6 Exercise - Interpretation of PLS1 (Jam) 145 6.9.7 Exercise - Interpretation PLS2 (Jam) 147 6.10 When to Use which Method? 149 6.10.1 Exercise - Compare PCR and PLS1 (Jam) 150

6.11 Summary 153

7. Validation: Mandatory Performance Testing

155

7.1 The Concept of Test Set Validation 155 7.1.1 Calculating the Calibration Variance (Modeling Error) 157 7.1.2 Calculating the Validation Variance (Prediction Error) 158 7.1.3 Studying the Calibration and Validation Variances 159 7.2 Requirements for the Test Set 161

7.3 Cross Validation 163

7.4 Leverage Corrected Validation 168

8. How to Perform PCR and PLS-R

171

8.1 PLS and PCR - Step by Step 171 8.2 Optimal Number of Components in Modeling 172 8.3 Information in Later PCs 173 8.4 Exercises on PLS and PCR: the Heart-of-the-Matter! 173 8.4.1 Exercise - PLS2 (Peas) 174 8.4.2 Exercise - PLS1 or PLS2? (Peas) 177 8.4.3 Exercise - Is PCR better than PLS? (Peas) 179

9. Multivariate Data Analysis – in Practice:

Miscellaneous Issues

181

(14)

9.1.1 Data Matrix Dimensions 183

9.1.2 Missing Data 183

9.2 Data Collection 184

9.2.1 Use Historical Data 184 9.2.2 Monitoring Data from an On-Going Process 185 9.2.3 Data Generated by Planned Experiments 185 9.2.4 Perform Experiments or Collect Data - Always by

Careful Reflection 186

9.2.5 The Random Design – A Powerful Alternative 187 9.3 Selecting from Abundant Data 188

9.3.1 Selecting a Calibration Data Set from Abundant

Training Data 188

9.3.2 Selecting a Validation Data Set 189

9.4 Error Sources 190

9.5 Replicates - A Means to Quantify Errors 190 9.6 Estimates of Experimental - and Measurement Errors 191 9.6.1 Error in Y (Reference Method): Reproducibility 192 9.6.2 Stability over Consecutive Measurements: Repeatability 193 9.7 Handling Replicates in Multivariate Modeling 195 9.8 Validation in Practice 198

9.8.1 Test Set 198

9.8.2 Cross Validation 198

9.8.3 Leverage Correction 199 9.8.4 The Multivariate Model – Validation Alternatives 199 9.9 How Good is the Model: RMSEP and Other Measures 200

9.9.1 Residuals 200

9.9.2 Residual Variances (Calibration, Prediction) 201 9.9.3 Correction for Degrees of Freedom 203 9.9.4 RMSEP and RMSEC - Average, Representative Errors

in Original Units 203

9.9.5 RMSEP, SEP and Bias 205 9.9.6 Comparison Between Prediction Error and Measurement

Error 206

9.9.7 Compare RMSEP for Different Models 207 9.9.8 Compare Results with Other Methods 207 9.9.9 Other Measures of Errors 208 9.10 Prediction of New Data 209 9.10.1 Getting Reliable Prediction Results 209 9.10.2 How Does Prediction Work? 209 9.10.3 Prediction Used as Validation 210

(15)

Contents xv

9.10.4 Uncertainty at Prediction 210 9.10.5 Study Prediction Objects and Training Objects in the

Same Plot 211

9.11 Coding Category Variables: PLS-DISCRIM 211 9.12 Scaling or Weighting Variables 213 9.13 Using the B- and the Bw-Coefficients 214 9.14 Calibration of Spectroscopic Data 215 9.14.1 Spectroscopic Data: Calibration Options 216 9.14.2 Interpretation of Spectroscopic Calibration Models 217 9.14.3 Choosing Wavelengths 219

10. PLS (PCR) Exercises: Real-World Application

Examples - I

221

10.1 Exercise - Prediction of Gasoline Octane Number 221 10.2 Exercise - Water Quality 230 10.3 Exercise - Freezing Point of Jet Fuel 233

10.4 Exercise - Paper 236

11. PLS (PCR) Multivariate Calibration – In Practice

241

11.1 Outliers and Subgroups 242

11.1.1 Scores 242

11.1.2 X-Y Relation Outlier Plots (T vs. U Scores) 244

11.1.3 Residuals 245

11.1.4 Dangerous Outliers or Interesting Extremes? 246

11.2 Systematic Errors 248

11.2.1 Y-Residuals Plotted Against Objects 249 11.2.2 Residuals Plotted Against Predicted Values 249 11.2.3 Normal Probability Plot of Residuals 251

11.3 Transformations 252

11.3.1 Logarithmic Transformations 253 11.3.2 Spectroscopic Transformations 254 11.3.3 Multiplicative Scatter Correction 256

11.3.4 Differentiation 259

11.3.5 Averaging 259

11.3.6 Normalization 259

11.4 Non-Linearities 260

11.4.1 How to Handle Non-Linearities? 262 11.4.2 Deleting Variables 263 11.5 Procedure for Refining Models 264

(16)

11.6 Precise Measurements vs. Noisy Measurements 265 11.7 How to Interpret the Residual Variance Plot 267 11.8 Summary: The Unscrambler Plots Revealing Problems 270

12. PLS (PCR) Exercises: Real-World Applications - II 273

12.1 Exercise ~ Log-Transformation (Dioxin) 273 12.2 Exercise - Multiplicative Scatter Correction (Alcohol) 276 12.3 Exercise – “Dirty Data” (Geologic Data with Severe

Uncertainties) 284

12.4 Exercise - Spectroscopy Calibration (Wheat) 291 12.5 Exercise QSAR (Cytotoxicity) 293

13. Master Data Sets: Interim Examination

303

13.1 Sgarabotto Master Violin Data Set 305 13.2 Norwegian Car Dealerships - Revisited 313

13.3 Vintages 317

13.4 Acoustic Chemometrics (a. c.) 321

14. Uncertainty Estimates, Significance and Stability

(Martens’ Uncertainty Test)

327

14.1 Uncertainty Estimates in Regression Coefficients, b 327 14.2 Rotation of Perturbed Models 328

14.3 Variable Selection 329

14.4 Model Stability 330

14.4.1 Introduction 330

14.4.2 An Example Using the Paper Data 330 14.5 Exercise - Paper - Uncertainty Test and Model Stability 332

15. SIMCA: An Introduction to Classification

335

15.1 SIMCA - Fields of Use 339 15.2 How to Make SIMCA Class-Models? 340 15.2.1 Basic SIMCA Steps: A Standard Flow-Sheet 340 15.3 How Do we Classify new Samples? 341 15.4 Classification Results 341

15.4.1 Statistical Significance Level and its Use: An

Introduction 342

15.5 Graphical Interpretation of Classification Results 344

15.5.1 The Coomans Plot 344

(17)

Contents xvii

15.5.3 Si/S0 vs. Hi 347

15.5.4 Model Distance 348

15.5.5 Variable Discrimination Power 349

15.5.6 Modeling Power 350

15.6 SIMCA-Exercise – IRIS Classification 351

16. Introduction to Experimental Design

361

16.1 Experimental Design 361

16.2 Screening Designs 375

16.2.1 Full Factorial Designs 376 16.2.2 Fractional Factorial Designs 378 16.2.3 Plackett-Burman Designs 382 16.3 Analyzing a Screening Design 383 16.3.1 Significant effects 386 16.3.2 Using F-Test and P-Values to Determine Significant

Effects 387

16.3.3 Exercise - Willgerodt-Kindler Reaction 391

16.4 Optimization Designs 395

16.4.1 Central Composite Designs 396 16.4.2 Box-Behnken Designs 400 16.5 Analyzing an Optimization Design 402 16.5.1 Exercise - Optimization of Enamine Synthesis 403 16.6 Practical Aspects of Making an Experimental Design 414

16.7 Extending a Design 428

16.8 Validation of Designed Data Sets 430 16.9 Problems in Designed Data Sets 431 16.9.1 Detect and Interpret Effects 433 16.9.2 How to Separate Confounded Effects? 436 16.9.3 Blocking and Repeated Response Measurements 436 16.9.4 Fold-Over Designs 438 16.9.5 What Do We Do if We Cannot Keep to the Planned

Variable Settings? 439

16.9.6 A “Random Design” 440 16.9.7 Modeling Uncoded Data 440 16.10 Exercise - Designed Data with Non-Stipulated Values

(Lacotid) 441

16.11 Experimental Design Procedure in The Unscrambler 444

(18)

17.1 Introduction to Complex Experimental Design Problems 447 17.1.1 Constraints Between the Levels of Several Design

Variables 447

17.1.2 A Special Case: Mixture Situations 450 17.1.3 Alternative Solutions 451 17.2 The Mixture Situation 455 17.2.1 An Example of Mixture Design 455 17.2.2 Screening Designs for Mixtures 457 17.2.3 Optimization Designs for Mixtures 460 17.2.4 Designs that Cover a Mixture Region Evenly 461 17.3 How To Deal With Constraints 463 17.3.1 Introduction to the D-Optimal Principle 463 17.3.2 Non-Mixture D-Optimal Designs 466 17.3.3 Mixture D-Optimal Designs 467

17.3.4 Advanced Topics 469

17.4 How To Analyze Results From Constrained Experiments 474 17.4.1 Use of PLS Regression For Constrained Designs 474 17.4.2 Relevant Regression Models 476 17.4.3 The Mixture Response Surface Plot 478 17.5 Exercise ~ Build a Mixture Design - Wines 479

18. Comparison of Methods for Multivariate Data

Analysis - And their Validation

489

18.1 Comparison of Selected Multivariate Methods 489 18.1.1 Principal Component Analysis (PCA) 490 18.1.2 Factor Analysis (FA) 492 18.1.3 Cluster Analysis (CA) 494 18.1.4 Linear Discriminant Analysis (LDA) 496 18.1.5 Comparison: Projection Dimensionality in Multivariate

Data Analysis 498

18.1.6 Multiple Linear Regression, (MLR) 498 18.1.7 Principal Component Regression (PCR) 499 18.1.8 Partial Least Squares Regression (PLS-R) 500 18.1.9 Increasing Projection Dimensionality in Regression

Modeling 501

18.2 Choosing Multivariate Methods Is Not Optional! 501 18.2.1 Problem Formulation 501

18.3 Unsupervised Methods 502

(19)

Contents xix

18.5 A Final Discussion about Validation 505 18.5.1 Test Set Validation 505

18.5.2 Cross Validation 506

18.5.3 Leverage Corrected Validation 508 18.5.4 Selecting a Validation Approach in Practice 509 18.6 Summary of Basic Rules for Success 510 18.7 From Here – You Are on Your Own. Good Luck! 511

19. Literature

513

20. Appendix: Algorithms

519

20.1 PCA 519 20.2 PCR 520 20.3 PLS1 521 20.4 PLS2 524

21. Appendix: Software Installation and User

Interface

527

21.1 Welcome to The Unscrambler 527 21.2 How to Install and Configure The Unscrambler 527 21.3 Problems You Can Solve with The Unscrambler 529 21.4 The Unscrambler Workplace 530

21.4.2 The Editor 532

21.4.3 The Viewer 534

21.4.4 Dockable Views 537

21.4.5 Dialogs 537

21.4.6 The Help System 539

21.4.7 Tooltips 540

21.5 Using The Unscrambler Efficiently 540

21.5.1 Analyses 540

21.5.2 Some Tips to Make Your Work Easier 545

Glossary of Terms

549