• No results found

CALIBRATION WITH ERROR IN Y ONLY

Howard Mark

THE ERROR-FREE CASE

8.3 CALIBRATION WITH ERROR IN Y ONLY

In the simplest and most studied case there is error in only the Y (the dependent) variable (i.e., in the reference laboratory values for the constituent). In spectroscopic practice this situation frequently arises when natural or synthetic products must be analyzed by a reference method in order to obtain values for the calibration, or training, set of samples. Figure 8.4a illustrates the danger of using only two points to determine the calibration line in the presence of error. As calibration line 1 shows, it is very unlikely that the points chosen will actually correspond to the proper slope and intercept of the line that the majority of points follow. If the points chosen are particularly inopportune, a line such as shown by calibration line 2 might even be obtained. In a case such as this, the desired calibration line is one such as illustrated in Figure 8.4b, where a “best fitting” calibration line is determined in such a way that all the data are taken into account. In practice, the best fitting line is determined by the method of least-squares regression.

The multiwavelength situation corresponding to Figure 8.4b is illustrated in Figure 8.5. Just as the line in Figure 8.4b was fitted to all the data for the single wavelength, the plane in Figure 8.5 is fitted to all the data in the multiwavelength case, splitting the differences among them.

The standard method of calibration under these conditions is the use of least-squares regression analysis. Regression is particularly suitable when the reference values for the analyte are determined by chemical analysis using a reference method. Regression then becomes the preferred method for a number of reasons: only the constituents for which calibrations are desired need be determined by

3

2

Absorbance 1

1

Concentration Concentration

2 P1 Calibration line

3 3

2

Absorbance 1

1 (a)

2 3

Line 1 Line 2

(b)

FIGURE 8.4 (a) When there is error in the data, using only two points to determine the line can result in a poorly fitting line as these two lines illustrate. (b) By getting a best fit to all the points, much better results will be obtained.

Concentration

A2

A1

Calibration plane

FIGURE 8.5 When an error is present in the data, the calibration plane must be fit by least squares.

the reference method; since the reference values usually contain the largest errors, making these the Y variables puts the errors in the proper place in the data, and, while simple regression is performed on data with symmetry between the two variables, the data used for multiple regression do not have the same symmetry under these circumstances. For example, in Figure 8.2, the two variables are equivalent and, as Equation (8.7) and Equation (8.8) show, can be related by equivalent mathematical formulas. On the other hand, if reference values for only one or a few constituents are available for the samples, it is not possible to write a set of equations corresponding to Equation (8.7) for the multivariate case; the symmetry of the two-variable system has been broken.

Regression analysis, both simple regression (meaning only one independent variable) and mul-tiple regression (more than one independent variable), has been developed over many years, and many of its characteristics are known. In particular, the least-squares method is valid only when the following four fundamental assumptions hold:

1. There is no error in the independent (X) variables.

2. The error in the dependent variable (Y) is normally distributed.

3. The error in the dependent variable is constant for all values of the variable.

4. The system of equations is linear in the coefficients (Note: not necessarily in the data; the variables representing the data may be of polynomial or even transcendental form).

Here we present the derivation of the equations for multiple regression analysis (the “least-squares”

fit of a multivariate equation to data). We note that the derivation is subject to these four assumptions being valid. Methods of testing these assumptions and advice on dealing with failure of them to apply in particular cases are well discussed in the statistical literature. The reader is strongly urged to become familiar with methods of inspecting data and to apply them routinely when performing regression calculations.

A more detailed derivation, including the derivation of the auxiliary statistics for a regression;

can be found in Applied Regression Analysis, by N. Draper and H. Smith (John Wiley and Sons, 2nd ed., 1981), a book that should be considered required reading for anyone with more than the most casual interest in regression analysis.

A dataset to be used for calibration via the multiple regression least-squares method contains data for some number (n) of readings, each reading presumably corresponding to a specimen, and some number (m) of independent variables, corresponding to the optical data. The dataset also contains the values for the dependent variable: the analyte values from the reference laboratory. We begin by defining the error in an analysis as the difference between the reference laboratory value of the analyte, which we call Y , and the instrumental value for the constituent, which we call ˆY (read “Y -hat”):

e= Y − ˆY (8.10)

Using X to designate the values of the optical data, we note that, for any given sample, ˆY is calculated according to the calibration equation

ˆY = b0+ b1X1+ b2X2+ · · · + bmXm (8.11) and thus the error e is equal to:

e= Y − b0− b1X1− b2X2− · · · (8.12) Equation (8.12) gives the error for a single reading of a single specimen in the set of specimens composing the calibration set. The least-squares principle indicates that we wish to minimize the sum of the squares of the errors for the whole set. The sum of the squares of the errors is

n

where the subscripts on the variables represent the different wavelengths, and the summations are taken over all the specimens in the set.

In order to minimize the sum of the squares of the errors, we do the usual mathematical exercise of taking the derivative and setting it equal to zero. For a given set of data, the error, and the sum squared error, will vary as the coefficients change, therefore the derivatives are taken with respect to each coefficient.

Taking the specified derivatives gives rise to the set of equations:

2(Y − b0− b1X1− b2X2− · · · ) = 0 (8.15a)

2X1(Y − b0− b1X1− b2X2− · · · ) = 0 (8.15b)

2X2(Y − b0− b1X1− b2X2− · · · ) = 0 (8.15c)

The next step is to divide both sides of Equation (8.15) by two, and simplify by multiplying out the expressions:

(Y − b0− b1X1− b2X2− · · · ) = 0 (8.16a)

(YX1− b0X1− b1X1X1− b2X1X2− · · · ) = 0 (8.16b)

(YX2− b0X2− b1X2X1− b2X2X2− · · · ) = 0 (8.16c)

The next step is to separate the summations and rearrange the equations:

b0+ b1

X1+ b2

X2+ · · · =

Y (8.17a)

b0

X1+ b1

X12+ b2

X1X2+ · · · =

X1Y (8.17b)

b0

X2+ b1

X1X2+ b2

X22+ · · · =

X2Y (8.17c)

The multipliers of the bion the left-hand side of Equation (8.17), as well as the terms of the right-hand side of Equation (8.17), are all constants that can be calculated from the measured data. Thus we have arrived at a system of m+ 1 equations in m + 1 unknowns, the unknowns being the bi, and the coefficients of the bibeing the summations. Solving these equations thus gives us the values for the bi

that minimizes the sum-squared errors of the original set of equations represented by Equation (8.13).

It is important to remind ourselves again at this point that the error of the data indicated in Figure 8.4b and Figure 8.5 is only in the Y (dependent) variable. In the vast majority of cases we deal with in near-infrared analysis (NIRA), this situation is found to obtain. Regression theory states that the standard error of estimate (SEE), the measure of error in the regression, should be equal to the error of the dependent variable. Indeed, it is because there is so little error in the instrumental measurements on NIRA that multiple regression analysis is an appropriate mathematical tool for calibrating the instruments.