• No results found

Chapter 4: Chemometrics: An outline of the development of the

4.4 Multivariate Data Analysis as a Tool to Explore Data

4.4.4 Prediction Modelling Using Partial Least Squares (PLS)

Partial Least Squares (PLS) regression techniques were applied, using the data from spectroscopic analysis and chemical composition data collected from HPLC analysis. PLS is a multivariate method that is used to measure covariance in the system under investigation. The variations in and between data are defined as variance and covariance, where variance is the measure of the spread of variable values. The association between two variables, X and Y or X and X, is a measure of their covariance. If large values of variable X1 occur together with large values of X2, the

covariance will be positive; conversely, if large values of variable X1 occur together with small values of variable X2, and vice versa, the covariance will be negative.

The correlation between two variables ሺݔǡ ݕ) is calculated by dividing the covariance value by the product of their respective standard deviations (S); this gives a unit-less scaled covariance measure:

ࡾ ൌ

ࢉ࢕࢜ሺ࢞ǡ ࢟ሻ

ࡿ࢞ࡿ࢟

Where R = the correlation value. The correlation value always lies between –1 and +1; a correlation of 0 means that there is no relationship, whereas a correlation of +1 or −1 means that there is an exactly linear positive or negative relationship. R² is the most common form of expressing correlation [8, 12]. In general, as a high R² value denotes a high correlation between two variables, one variable can accurately estimate the value of the other variable. In many cases, several variables in the dataset contribute to the property of interest Y. Correlation between variables does not ensure causality, i.e. if any variables are correlated with the property of interest, this does not necessarily mean that they are the cause of the value of the property of interest. Further investigation of the system under investigation and knowledge of contributing causes is required to establish cause and effect.

In this thesis PLS regression techniques were applied, using the data from the spectroscopic analysis and DHA results collected from HPLC analysis, to determine whether there was a correlation between the Raman spectral data from L. scoparium

leaves and the DHA levels in L. scoparium nectar. This method was investigated because DHA, anthocyanins and carotenoids are plant secondary metabolites and it

synthesis, anthocyanins etc. may be correlated in L. scoparium. The normalised DHA concentration values obtained from the nectar of the cultivars using HPLC are shown in Figure 4.17.

Figure 4.17. HPLC data showing DHA values for the cultivars normalised to 80° BRIX. It should be noted

that only five out of the seven initial cultivars were used in this model because only five cultivars from which nectar was collected flowered during this experiment.

Figures 4.18-4.20 show the various overview plots from the resulting PLS model. Figure 4.18 is the variance plot, which shows how much of the model is explained by the variance in the data for each factor. The factors plotted represent percentages of the variance in the data for each variable, i.e. spectral features and DHA levels. PLS methods use the non-iterative partial least squares (NIPALS) algorithm to calculate the components in PLS and PCA models. PLS factors are similar to PCA components. Total values below 50% indicate that a model is not well explained. The explained variance plot indicates that this model was well explained by the variance in the data and that three or four factors gave the best model including 78% of the explained variance well

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 Y P MG B O DHA (m g/kg) normal is ed to 80 º Bri x Cultivar Y P MG B O

validated shown by the closeness of the red validation line to the modelled line. If the validated explained variance line plot starts to fall away from the modelled line it indicates that those factors are not well validated for inclusion in the model.

Figure 4.18. Explained variance plot for the PLS model.

Figure 4.19 illustrates the PLS score plot of factor 1 against factor 3 showing PLS component grouping best separating the five different cultivars. These groupings according to cultivar validate the cumulative distribution plot that the spectral data did not have a normal distribution and suggest that this distribution is due to the differences in leaf components between the cultivars.

Figure 4.19. PLS component plot of factor 3 versus factor 1 separates the P and Y cultivars better than the plot of PC1 vs PC2.

The graph in Figure 4.20 shows the PLS model itself with predicted line versus the reference line of the model and an R² value of 0.78, indicating a validated model with a predictive error component of 764.77 mg/kg across a range for DHA of 2800-8000 mg/kg.

Figure 4.20. Regression graph of PLS model, note the R² value of 0.78 and the inclusion of the first three factors for this model.

In the case of PLS models it is worth noting the weighted regression co-efficients which relate to the loading information and show the relative contribution and influence from wavenumbers from each factor to the model, and also illustrates information on which component spectra are having the most influence on the PLS model. In the case above the first factor shown in the explained variance plot (Figure 4.20) accounts for approximately 65% of the model and therefore has the largest influence of all the factors and so the related coefficients from that factor have the most importance. As stated the first four factors gave the best model and so all four factor co-efficients need to be analysed to investigate which wavelengths are having the most influence in the overall model. Higher values in the weighted regression co-efficient plots indicate

Related documents