Introduction
500 Horizontal Profile
1.6 Multivariate analysis
1.6.2 The Unscrambler
The Unscrambler {CAMO 1994) is a commercial software package that was used in this work to perform multivariate analysis on spectra obtained using energy dispersive diffraction techniques on bone phantoms. The package contains a number of analytical methods for the analysis of sets of data to find internal relationships within the data. In this study the Unscrambler was used to do two things.
1. Establish the regression relationships between two sets of data. The program achieves this by using two sets of known data consisting of X-variables and Y- variables called a training or calibration set. The X data consists of a number of objects with each object containing a number of variables. In this study an object is a measured diffraction spectrum and the variables are the momentum transfer values in that spectrum. The Y data are quantities that are related to a given spectrum so must therefore contain the same number of objects with each object having the number of variables chosen to be modelled. A single variable model may be the bone mineral content represented in the spectrum, a two variable model could be the bone content and marrow content. A three variable model could be the addition of cortical thickness and a four variable model the further addition of soft tissue thickness surrounding the bone. However many variables we choose to model we have
X-variables + Y-variables => model
2. The Unscrambler is used to predict unknown values of the Y-variables from new X-variables and the model previously created, i.e. the prediction process is
X-variables + model => Y-variables
The method of modelling used to perform the above is PLSR as described in section 1.6.1.1. The PLSR method performs a simultaneous and interdependent principal component analysis in both the X and Y matrices in such a way that the information in the Y-matrix is used as a guide for the optimal data reduction of the X-matrix. This method was chosen because it handles several co-varying Y-variables better than principal component regression or multiple linear regression. The Unscrambler has two PLS algorithms, PLSl which handles only one Y-variable at a time and PLS2 which is used for handling several Y-variables simultaneously.
Once the principal components are found they are stored in a T matrix with which the regression is performed on Y. The T-variables are given the term scores which express the relation between objects. Each column in the T matrix corresponds to one principal component (PC) and each row corresponds to each object in the X-data. The scores in the matrix indicate which objects are responsible for most of the variation in the data set or, scores are a measure of how much of a particular PC is present in a particular object.
A variable called the loading, expresses the relationship between individual variables and the principal components i.e. they tell you which variables are dominantly influencing the model. Loadings are the regression coefficients of each variable to each PC and are stored in matrices P and Q which are computed by the program as a result of the regression of X to T and Y to T respectively. The Q matrix has one line per PC and each line has one element per Y-variable. When the model is used for prediction purposes the values of the predicted Y variables (Ypred )are computed from
Ypred = TQ +Ycentre
where Ycentre is the mean value of the known Y-variables in the calibration set
The X and Y residuals (or residual variance) are the differences between the measured and modelled X and Y data and represent the data that cannot be included in the model or how much of the variation in the data that is unexplained. The residuals are stored in E and F matrices respectively. The residuals serve as a valuable validation function in the modelling procedure. The model is validated to obtain a measure of how good it is and how accurate future predictions made from the model will be. In the Unscrambler the validation process is carried out during calibration and the procedure used in this study is one called cross validation. Cross validation can be used when the
Chapter one Introduction
calibration set is large and representative of the future predictions to be made. A series of calibrations is made with different subsets of the calibration objects used as validation objects which in effect allows the calibration model to be tested against real test objects. During the calibration process, the residuals are computed for each PC both for the calibration and validation objects. The optimal number of PC’s to use in the model and hence future predictions is the number that gives the smallest residuals.
Another useful measure the Unscrambler computes is the root mean square error of prediction (RMSEP) which is an expression of the expected error in a predicted Y value. The RMSEP is defined as the square root of the average of squared differences between predicted and measured Y-variables (or the square root of the residual Y variance) and is calculated from the validation objects used in the cross validation procedure during calibration. The predicted values given by the Unscrambler are as an absolute value ± a deviation which is calculated using an empirical formula based on the X and Y residual variances.