4.3 Mathematical Relation
4.3.5 Least Squares Method
To describe the relation between the dependent variable and the independent variables the regression parameters in the regression function need to be determined. The method of least squares is a method which is used to determine the regression parameters. The
51
least squares method can be applied to both linear and nonlinear regression models[64, 68].
The method of least squares is an important application in curve fitting and data fitting. The method provides the best fit line for a given set of data. It achieves this by minimizing the sum of squared residuals where a residual if the difference between the observed value and the fitted value provided by a model.
A typical simple linear regression model can be described as shown in equation 4.1, the parameters a0 and a1 determine the fit of the model and accuracy with which the model can predict the dependent variable from the independent variable. The equation can have several solutions for the values of a0 and a1 and the method least squares helps in finding the values of a0 and a1 which provides the best fit where the error involved is the sum of the squares of the difference between the value of the dependent variable y from the model and the values in the data set is minimum[69]. The equation to minimise the least squares error and the error function is shown below in equation 4.5
∑ [ ] 4.5
Where a0, a1 are the regression parameters and n represents the number of data points. The goal is to minimise the total error for the line of best fit for the given data and for a minimum to occur the partial derivatives are taken and set to zero [63, 66, 69].
∑ [ ] ∑ 4.6
and
∑ [ ] ∑ 4.7 The above equations can be simplified to
∑ ∑ 4.8
and
52
When the above equations are solved for a0 and a1 we get the following equations for them
∑ ∑ ∑ ∑
(∑ ) (∑ ) 4.10
∑ ∑ ∑
(∑ ) (∑ ) 4.11
The least squares method can also be applied for polynomial regression and the general form the polynomial model is shown in equation 4.12. In the case of the polynomial model the constants a0, a1, a2, a3,…,an are chosen to minimise the least squares error.
4.12
4.13
Where n <m-1, n is the order of the polynomial and m is the number of data points. The error function for minimising the least squares is given as
∑ [ ] 4.14
∑ ∑ ∑ ( ) 4.15
∑ ∑ (∑ ) ∑ (∑ ) 4.16
∑ ∑ (∑ ) ∑ ∑ (∑ ) 4.17
Similar to the previous case of simple linear regression for the error function to be minimised it is necessary that the partial derivatives , for each j=0,1,2,…,n.
∑ (∑ ) ∑ (∑
) 4.18
This gives n+1 normal in the n+1 unknown aj,
53
for each j=0,1,2,…,n. The equation can be expanded as shown below, where m is the number of data points, a0,a1,a2,…,an are the constants in the polynomial, n is the order of the polynomial , x is the independent variable and y is the dependent variable.
∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ to ∑ ∑ ∑ ∑ ∑
The above equations can be solved in the same way as the equations for the simple regression model and will generate the values of the constants which give the least error and best fit for the given data.
4.3.6 Mathematical Relation
The method of least squares allows the user to apply the regression models to the data and determine the regression coefficients for the model chosen for analysis. To determine the mathematical equation defining the relation between the colour red and pH the method of least squares was applied to the data collected.
The mean values of the colour red collected from the histogram over the pH range 6.1 to 7.9 were used to derive the mathematical relation. The mean values for the colour red were normalised to a scale of 0 to 1. The normalised values of the colour red are shown below in table 4.1.
54
Table 4.1: The normalised values of the colour red for the different samples used regression and the average of the normalised values
The plot of the normalised values of the colour red from different samples against the ph values is shown in figure 4.7 and as it has been observed previously in the plots of the colour red against pH they all follow a similar trend. The normalised values of the colour red were averaged and have been shown in the last column of the table 4.7 and have also been plotted in the figure 4.7. The trendline in the figure 4.7 shows the average values. The method of least squares was applied to all the curves in the plot to determine the curve which best fit the data. As observed from the plot the relation between the colour and the pH is not completely linear and as explained in section 4.3.4, a non-linear relation will not provide a better explanation and hence polynomial regression was chosen to define the relation.
55
Figure 4.7: A plot of the normalised values of the colour red against the pH. The plot also shows the average values on the colour red.
Polynomial regression using the method of least squares was applied to all the curves in the plot to find the curve which will best fit the data. To define the equation in terms of the colour red a regression of pH on the colour red was done[64]. The results of the regression analysis presented different equations for the data, but the equation which best fit the data was from the curve for the average values. The curve with average values was used to define the equation in terms of the colour red and a regression of pH on the colour red was performed. As a part of the polynomial regression, polynomials from the order of two and higher were applied to the curve. Table 4.2 shows the comparison of the predicted values of the pH for different orders of the polynomial. Polynomials of orders higher than six weren’t applied as they generally don’t provide any improvement in the error function. It can be observed from the table the fifth order polynomial predicts the pH closest to the actual value and going to the sixth order doesn’t increase the accuracy of the predictions. The fifth order polynomial model was used define the relation and the polynomial equation is shown in equation 4.21.
4.21
56
Table 4.2: The table shows the comparison of the predicted pH values for the different orders of the polynomial equation determined by polynomial regression.
The fifth order polynomial equation was chosen as it provides better resolution and avoids errors due to rounding up in determining the pH. The higher orders did not present any improvements in the resolution or rounding errors.