• No results found

2.3. Chemometrics for Data Analysis 36

2.3.2. Data Processing 47

2.3.2.2. Calibration, Validation and Cross-Validation 52

“Calibration concerns how to find the mathematical formula that optimally converts, for example, instrument data X to results Y” (Martens and Martens 1986). The purpose of the calibration model is to relate the measured sample property to the spectral data collected from that sample (Westerhaus et al. 2004). The samples used to build the calibration have to (1) comprise a wide range of analyte values and also spectral responses, (2) have a wide variety of sample types and large constituent range, and (3) represent the target population (Malley and Martin 2003; Westerhaus et al.

2004). The calibration set should represent not only the chemical characteristics of the constituent of interest anticipated in the future samples to be predicted by the calibration, but also the physical characteristics (Brimmer et al. 2001; Malley and Martin 2003). Samples in the calibration set are also called the training set.

Various statistical methods are used to develop calibration models. After pre- treatment procedures are applied to the spectral data, a range of statistical methods can be used to relate pre-processed spectral data and analyte values. Statistical methods vary from relatively straightforward methods, such MLR, to more complex methods such as SMLR, PCR, PLSR, and Artificial Neural Networks (ANN) (Westerhaus et al. 2004).

There are many ways to select the samples for the calibration set. Westerhaus et al. (2004) described seven methods for determining how samples are selected. These are (1) random selection, (2) manual selection, (3) spectral subtraction, (4) stratified selection, (5) discriminant analysis using wavelength selection, (6) discriminant analysis using PCA selection, and (7) correlation matching. In general, the purpose of the sample selection is (a) to reduce the number of samples needed to develop a calibration at a practical level and (b) to obtain a data set that represents the population to be analysed by NIR spectroscopy (Westerhaus et al. 2004). The samples should cover the range of variability anticipated in future samples (Hruschka 2001).

Islam et al. (2003) used a random technique to separate the calibration and validation set. As reported by Brown et al. (2005), random selection may overestimate predictive accuracy and seems not appropriate because different selections led to different results. McCarty et al. (2002) found quite good prediction (with little bias) of

total C, inorganic C, organic C and organic C (acid) using validation set separated from one third samples of calibration set. There were 273 samples collected samples from 14 geographically diverse locations in the central of USA. But when they used an independent validation set composed solely of samples from the Nebraska sampling site, they found higher bias. Experience from previous study by inclusion of just a few samples from new sampling location into the calibration set can readily correct bias for similar location (McCarty and Reeves III 2000). As stated by McCarty et al. (2002), “the robustness of calibration set can depend on the extent of the property domain for the calibration set”. Williams (2001) described a selection procedure for separating calibration and validation sets by sorting samples using reference data and transferring the first three samples into the calibration set; leaving the next two samples at the validation set and then again taking the next three (6-8) sample into the calibration set; leaving the next two (9-10) in the validation set, and then taking again the next three (11-13) samples for the calibration set, and so on. If the number of samples is large, Williams (2001) suggested the ratio of calibration and validation set should be 3:1. Martin et al. (2002) determined the calibration and validation set by sorting the data (reference data) from the lowest to the highest value and dividing them into two sets by putting every other spectrum into set A (the calibration set), and the remaining spectra into set B (the validation set), so the ratio of calibration set and validation set is 1:1. Each set therefore represented the full range of constituent concentrations. This method was also used by Malley et al. (2002) and they obtained useful and successful prediction of various chemical properties of soil and hog manure. These selection methods for the calibration and validation sets require prior knowledge of reference values. Selection can also be based on spectral data. Valdes et al. (2006) used Mahalanobis (H) distance to select samples for the calibration and validation sets. Samples with a standardised H

distance of a minimum 0.6 from their nearest neighbour were chosen for the calibration set, and the rest of as the validation set. So the calibration set was selected on the basis of sample dissimilarities. This method is based on the ideas of Shenk and Westerhaus (1991a; b). Chang and Laird (2001) selected a calibration set from all of those in the data base (including the sample being tested) based on the shortest squared Euclidean distance to the derivative reflectance spectra for the test sample. So, in this case the calibration set was selected on a basis of similarity between samples. Due to lack reports dealing with comparative accuracy of those selection methods, this needs further investigation.

There is no fixed minimum number sample in the calibration sets used by researchers. Shepherd and Walsh (2002) evaluated the size of the calibration set from its predictive performance and found the performance decreased only gradually as the size of the calibration set was reduced from 800 to 200. But it then decreased rapidly when the calibration set contained less than 100-200 samples. Malley and Martin (2003) suggest at least 50 samples are needed for a calibration set, with preferably a minimum of 100 to 150 samples. However, some workers have used less than 50 samples; e.g. 41 samples (Daniel et al. 2003), 25 samples (Russel 2003) and 15 samples (Shibusawa et al. 2001). Thus it seems difficult to make a general statement about the minimum.

When using PLSR to develop calibration models, the optimum number of components (latent variables) should be determined. If the number of components used is more than the optimum number, the error of prediction may increase and the calibration models may be overfitting (Christy and Kvalheim 2007). The optimum number of components is obtained when the model produces the lowest predicted residual error sum of square (PRESS). This can be found by running leave-one-out cross-validation (Miller and Miller 2005; MINITAB Inc. 2003).

To calibrate spectral data with more than one response variable (dependent variable or Y-variable), we may treat the response variables separately or collectively. To calibrate spectral data against one Y-variable value is called PLSR1 (Miller and Miller 2005). “PLSR1 regression relates a single Y-variable to a block of X-variables x1, x2, x3,

x4,…, xK onto a new factors t1, t2,…,tA (A<K) and using these estimated factors as

regressors for the variable y” (Martens and Martens 1986). While, to treat more than one Y-variable collectively is called PLS2. It is usually used when the response variables are correlated with each other (Miller and Miller 2005).

There can be many factors causing error in developing calibration model. In Table 2.4 below, Mark and Workman (2003) summarised the sources of major calibration error. Furthermore, they emphasized the three commonest error sources in any calibration: (1) reference laboratory error (stochastic error source), (2) different aliquot error (non-homogeneity of sample - stochastic error source), and (3) non-representative sampling in the training set or calibration set population.

Table 2.4 The sources of major calibration error (Mark and Workman 2003). Source of error

• Sample non-homogeneity

• Laboratory error

• Physical variation in sample

• Chemical variation in sample with time

• Population sampling error

• Non-Beer’s Law relationship (non-linearity)

• Spectroscopy does not equal wet chemistry

• Instrument noise

• Integrated circuit problem

• Optical polarization

• Sample presentation extremely variable

• Calibration modelling incorrect

• Poor calibration transfer

• Outlier samples still in the calibration set

• Transcription error

2.3.2.2.2. Validation

Comparison of NIR measurements and reference data in the new sample set is called validation (or verification) of the calibration. As stated by Martens and Martens (1986), “validation is a procedure to ensure that the model and its estimated parameters really have predictive ability for future, unknown samples of the same general kind”. Validation provides a basis for calculation of the true measurement error. “It is an important step in the data analysis and guards against over-fitting” (Martens and Martens 1986).

2.3.2.2.3. Cross-Validation

“Cross-validation is a procedure that repeats the modelling several times, each time using only some of the training samples in the model estimation and the rest as test samples from which y is predicted from X” (Martens and Martens 1986). It involves eliminating the first sample, or block of samples, from the calibration model development, and predict the eliminated samples with the model developed; this procedure is repeated for all samples or block of samples (Williams 2001). Cross-

validation can be carried out by dividing the sample into equal “blocks” and eliminating samples one block at a time. However, the most effective way is to eliminate one sample at a time (called one-leave-out cross validation) (Williams 2001). Cross- validation is applied when the sample set is small. It is also applicable to large populations, but the computation may be time consuming. Cross-validation is used for model fitting and validation (Martens and Martens 1986).