Nonlinear Regression - Regression Analysis 10.1

Regression Analysis 10.1

10.1.2 Nonlinear Regression

A nonlinear regression model, which is non-linear in regression parameters 𝜽 = (𝜃₀, 𝜃₁, 𝜃₂… 𝜃_𝑘), is utilized for estimating 𝜽 based on the assumed non-linear relationship between 𝒚 and 𝒙_𝑖. The relation can be expressed in general form by the set of regression equations for a sample size of 𝑛 observations as expressed below (Gallant, 1975, p.

73):

𝑦_𝑖 = 𝑓(𝒙_𝑖, 𝜽) + 𝜖_𝑖 ∀𝑖: 𝑖 = 1,2, … , 𝑛 (10.1.56) The term 𝑦_𝑖 is the value of the dependent variable for the 𝑖^𝑡ℎ of 𝑛 observations, 𝑓(𝒙, 𝜽) represents the expectation function, 𝒙_𝑖 is the 𝑘 + 1 dimensional row vector of 𝑖^𝑡ℎ inputs (i.e.

inclusive a constant), 𝜽 is a 𝑘 + 1 dimensional vector of unknown parameters, and 𝜖_𝑖 is the error term for the 𝑖^𝑡ℎ observation with the same properties as in the linear regression (i.e.

𝝐~𝑁_𝑛(𝟎, 𝜎_𝝐²𝑰_𝑛)). The general non-linear model can also be expressed in matrix form as follows (Gallant, p. 73):

𝒚 = 𝑓(𝒙, 𝜽) + 𝝐 ^(10.1.57)

Where

𝒚^′ = (𝑦₁, 𝑦₂, … , 𝑦_𝑛) (10.1.58)

𝑓^′(𝜽) = [𝑓(𝒙₁, 𝜽), 𝑓(𝒙₂, 𝜽), … , 𝑓(𝒙_𝑛, 𝜽)] (10.1.59)

𝝐^′= (𝜖₁, 𝜖₂, … , 𝜖_𝑛) (10.1.60)

The likelihood of the general nonlinear model 𝑙(𝜽, 𝜎_𝝐²) can be represented as shown below (Fox, p. 463):

𝑙(𝜽, 𝜎_𝝐²) = 1

(2𝜋𝜎_𝝐²)^𝑛/2𝑒𝑥𝑝 [− 1

2𝜎_𝝐²𝑆𝑆𝐸(𝜽)] (10.1.61)

107 The function 𝑆𝑆𝐸_𝑁𝐿(𝜽)⁸⁸ denotes the sum of squares of error function for nonlinear regression and can be explicitly expressed as follows (Fox, p. 463):

𝑆𝑆𝐸_𝑁𝐿(𝜽) = ∑[𝑦_𝑖− 𝑓(𝒙_𝑖, 𝜽)]²

𝑛

𝑖=1

= [𝒚 − 𝑓(𝜽)]^′[𝒚 − 𝑓(𝜽)] = ‖𝒚 − 𝑓(𝜽)‖² (10.1.62)

As in the case of the general linear model, the objective is to maximize 𝑙(𝜽, 𝜎_𝝐²) by minimizing 𝑆𝑆𝐸_𝑁𝐿. Subsequently, 𝑆𝑆𝐸_𝑁𝐿 can be differentiated to derive normal equations as indicated below (Fox, p. 464):

𝜕𝑆𝑆𝐸_𝑁𝐿(𝜽)

𝜕𝜽 = −2 ∑[𝑦_𝑖− 𝑓(𝑥_𝑖, 𝜽)]𝜕𝑓(𝑥_𝑖, 𝜽)

𝜕𝜽 ^(10.1.63)

The normal equations can be achieved by setting these partial derivatives to 0, and replacing the unknown parameters 𝜽 with the vector of non-linear least squares estimates 𝜽̂. The normal equations can also be represented in matrix form as follows (Fox, p. 464):

[𝑭(𝜽̂, 𝒙)]^′[𝒚 − 𝑓(𝜽̂, 𝒙)] = 0 ^(10.1.64) The term 𝑭(𝜽̂, 𝑿) is the matrix of derivatives with 𝑖^𝑡ℎ row and 𝑗^𝑡ℎ column entry (Fox, p. 464):

𝐹_𝑖,𝑗 =𝜕𝑓(𝜽̂, 𝑥_𝑖)

𝜕𝜃̂_𝑗 ^(10.1.65)

In nonlinear regression models, the derivatives of expectation functions w.r.t. the parameters in 𝜽̂ depend on at least one of the parameters in 𝜽̂. Note that in linear regression the derivatives are not functions of 𝜷’s. Therefore, in nonlinear regression more advanced methods are required for the computation of 𝜽̂. In the following subsection, information on the methods of estimating 𝜽 is given.

10.1.2.1.1 Methods of Computing Nonlinear Least Squares Estimators

The procedure for the computation of the nonlinear normal equations starts through linearization of the nonlinear function and then continues with the application of the least-squares method on the linearized relation. The linearization of the expectation function is

88 The subscript "𝑁𝐿" in 𝑆𝑆𝐸𝑁𝐿 stands for nonlinear regression, in order to distinguish with 𝑆𝑆𝐸 previously mentioned in Subsection 10.1.1 for the linear regression.

108 achieved by using the Taylor series expansion of 𝑓(𝒙_𝑡, 𝜽) about the point 𝜽₀^′ = [𝜃_1,0, 𝜃_2,0, … , 𝜃_𝑝,0] without the second and higher order terms of the series as shown below (Draper & Smith, 1981, p. 462):

𝑓(𝒙_𝑖, 𝜽) = 𝑓(𝒙_𝑖, 𝜽₀) + ∑ [𝜕𝑓(𝒙_𝑖, 𝜽)

𝜕𝜃_𝑗 ]

𝜽=𝜽₀ 𝑘+1

𝑗=1

(𝜃_𝑗− 𝜃_𝑗,0) ∀𝑖: 𝑖 = 1,2, … , 𝑛 (10.1.66)

The zero subscript of 𝜽₀, in Eq. (10.1.66), indicates the initial (zeroth) iteration for the chosen starting value of 𝜽.

The common methods of computing non-linear least squares estimators are stated to be Hartley's modified Gauss-Newton method and Marquardt's algorithm (Gallant, p. 76). The information given in this section encompasses the idea of linearization and iterative process in a routine computer calculation.

Hartley's modified Gauss-Newton method

The Gauss-Newton method is based on the substitution of the first-order Taylor series expansion of 𝑓(𝜽) about a trial (𝑇) parameter value 𝜽_𝑇 in the formula for 𝑆𝑆𝐸_𝑁𝐿(𝜽) (Gallant, p. 76):

𝑆𝑆𝐸_𝑁𝐿(𝜽_𝑇) = ‖𝒚 − 𝑓(𝜽_𝑇) − 𝐹(𝜽_𝑇)(𝜽 − 𝜽_𝑇)‖² (10.1.67) The approximating sum of squares obtained from Eq. (10.1.67) can be minimized by linear least squares. This opportunity can be attained by substituting the terms in general non-regression model with the below given corresponding terms for 𝜽₀ (Draper & Smith, p. 462):

𝑓_𝑖⁰= 𝑓(𝒙𝑖, 𝜽0) (10.1.68)

𝑏_𝑗⁰= 𝜃_𝑗− 𝜃_𝑗0 (10.1.69)

𝐹_𝑖,𝑗⁰ = [𝜕𝑓(𝒙_𝑖, 𝜽)

𝜕𝜃𝑗

]

𝜽=𝜽₀

(10.1.70)

109 Subsequently, the substitution results in approximated form of a linear regression model as represented below (Draper & Smith, p. 463):

𝑦_𝑡− 𝑓_𝑡⁰ = ∑ 𝑏_𝑗⁰𝐹_𝑡,𝑗⁰ + 𝜖_𝑡, 𝑡 = 1,2, … , 𝑛

𝑝

𝑗=1

(10.1.71)

or in vector form as

𝒚₀ = 𝑭₀𝒃₀+ 𝝐 (10.1.72)

Hence, the estimate of 𝒃_𝟎, i.e. “𝒃̂_𝟎”, can be computed using least squares method as follows (Draper & Smith, p. 463):

𝒃̂_𝟎= (𝑭₀^′𝑭₀)^−𝟏𝑭₀^′𝒚₀

= (𝑭₀^′𝑭₀)^−𝟏𝑭₀^′(𝒚 − 𝒇₀)

(10.1.73)

The value of the parameter 𝜽_𝑀 minimizing the approximating sum of squares following 𝑇 iterations can be expressed as given below (Gallant, p. 76) in Eqs.(10.1.74) and (10.1.75):

𝜽_𝑀 = 𝜽_𝑇+ 𝒃̂_𝑇 _(10.1.74)

𝜽_𝑀 = 𝜽_𝑇 + [𝐹^′(𝜽_𝑇)𝐹(𝜽_𝑇)]⁻¹𝐹^′(𝜽_𝑇)[𝒚 − 𝑓(𝜽_𝑇)] (10.1.75) The iterative solution process for the approximating sum of squares proposed by Hartley proceeds as follows (Gallant, p. 76):

1. 0^th Iteration: Choose a starting estimate 𝜽₀ and compute

𝑫₀ = [𝐹^′(𝜽₀)𝐹(𝜽₀)]⁻¹𝐹^′(𝜽₀)[𝒚 − 𝑓(𝜽₀)] (10.1.76) Then, find a 𝜆₀ between 0 and 1 such that

𝑆𝑆𝐸_𝑁𝐿(𝜽₀+ 𝜆₀𝑫₀) ≤ 𝑆𝑆𝐸_𝑁𝐿(𝜽₀) (10.1.77)

2. 1^st Iteration: Let 𝜽₁ = 𝜽₀ + 𝜆₀𝑫₀ and compute

𝑫₁ = [𝐹^′(𝜽₁)𝐹(𝜽₁)]⁻¹𝐹^′(𝜽₁)[𝒚 − 𝑓(𝜽₁)] (10.1.78) Then, find a 𝝀₁ between 0 and 1 such that

110 𝑆𝑆𝐸_𝑁𝐿(𝜽₁+ 𝜆₁𝑫₁) ≤ 𝑆𝑆𝐸_𝑁𝐿(𝜽₁) (10.1.79) 3. 2^nd Iteration: Let 𝜽₂ = 𝜽₁ + 𝜆₁𝑫₁

⋮

A practical method for choosing the step length 𝜆_𝑙 at each iteration (𝑙) is by picking up the largest number in the sequence 𝑎_𝑞 = (.8)^𝑞 𝑞 = (0,1,2, … ) for which 𝑆𝑆𝐸_𝑁𝐿(𝜽_𝑖 + 𝑎_𝑞𝑫_𝑖) <

𝑆𝑆𝐸(𝜽_𝑖) (Gallant, p.76). See Gallant (p. 76) for other methods for choosing 𝜆_𝑙. The iterative solution process can be continued until the termination by a stopping rule such as

‖𝜽_𝑙− 𝜽_𝑙+1‖ < 𝜀(‖𝜽_𝑙‖ + 𝜏) (10.1.80)

and simultaneously

|𝑆𝑆𝐸_𝑁𝐿(𝜽_𝑙) − 𝑆𝑆𝐸_𝑁𝐿(𝜽_𝑙+1)| < 𝜀(𝑆𝑆𝐸_𝑁𝐿(𝜽_𝑙) + 𝜏) (10.1.81) where 𝜀 > 0 and 𝜏 > 0 are preset tolerance limits, e.g. 𝜀 = 10⁻⁵ and 𝜏 = 10⁻³ (Gallant, p.

76).

Marquardt's algorithm

Marquardt's algorithm is another method providing solution to 𝑆𝑆𝐸_𝑁𝐿(𝜽_𝑇) by approximation as shown below (Gallant, p. 77):

𝜽_𝛿= [𝐹^′(𝜽_𝑇)𝐹(𝜽_𝑇) + 𝛿𝑰]⁻¹𝐹^′(𝜽_𝑇)[𝒚 − 𝑓(𝜽_𝑇)] (10.1.82)

The basis of the Marquardt's algorithm is formed by the fact that for all 𝛿 sufficiently large, 𝜽_𝛿 is an improvement such that 𝑆𝑆𝐸_𝑁𝐿(𝜽_𝛿) is smaller than 𝑆𝑆𝐸_𝑁𝐿(𝜽_𝑇) under appropriate conditions (Gallant, p. 77). The initial value of 𝛿₀ is commonly set to some small number, e.g.

10⁻⁸ (Fox, p. 466). If 𝑙 + 1^𝑡ℎ iteration results in 𝑆𝑆𝐸_𝑁𝐿(𝜽_𝑘+1) < 𝑆𝑆𝐸_𝑁𝐿(𝜽_𝑘), then the new value of 𝜽_𝑙+1 is accepted and the next iteration is initiated with 𝛿_𝑙+2= 𝛿_𝑙+1/10; if however, 𝑆𝑆𝐸_𝑁𝐿(𝜽_𝑙+1) > 𝑆𝑆𝐸_𝑁𝐿(𝜽_𝑙), then 𝛿_𝑙 is increased by a factor of ten and tried again (Fox, p.

466). The Marquardt procedure seems similar to Gauss-Newton; when 𝛿 is small. Note that Marquardt algorithm is stated to be more difficult to implement than the Gauss-Newton, since both the conditioning factor 𝛿 and step factor 𝜆 must be manipulated (Bates & Watts, p. 81).

See Bates and Watts (p. 81) for more information.

111 Gallant (p. 78) notes that using either method may not lead to convergence to 𝜽_𝑀 from a starting value. The reasons for not being able to achieving a convergence may depend both on the distance of the starting value from the correct answer and on the extent of over-parameterization in the response function relative to the data. In case of a failure of convergence, it is recommended to find better starting values or to use a similar response function with fewer parameters. Further, in case of a convergence, it is suggested to check for several reasonable starting values to see whether the iterations converge to the same answer for each starting value.

10.1.2.1.2 Statistical Properties of Nonlinear Least Squares Estimators

Nonlinear regression inference is carried out through the linear approximation of non-linearity (i.e. discussed in Subsection 10.1.2.1.1) to reduce the condition to the linear case and then, by analogy use linear model inference results (Bates & Watts, p. 52). Note that the use of approximation leads to approximate (asymptotic) results rather than exact ones. It should be emphasized that the standard error can be exact, when the sample size is infinitely large. In case of a finite sample size, the calculated standard error is only an approximation which improves itself as sample size gets larger.

The two of the non-linear model inferences, which can be considered in analogy with linear model inferences, are mentioned in the following. See Gallant (pp. 78-81) for more information about hypothesis testing and confidence intervals of nonlinear regression models and see Subsection 10.1.1.2 for linear model inferences for analogy.

An approximate 100(1 − 𝛼)% confidence interval for 𝜃_𝑗 with an approximate standard deviation (𝐴𝑆𝐸(𝜃̂_𝑗)) can be expressed by the confidence statement given below (Graybill &

Iyer, 1994, p. 610):

𝐶[𝜃̂_𝑗− 𝑡_{𝑛−𝑘−1}^1−𝛼/2𝐴𝑆𝐸(𝜃̂_𝑗) ≤ 𝜃_𝑗 ≤ 𝜃̂_𝑗+ 𝑡_{𝑛−𝑘−1}^1−𝛼/2𝐴𝑆𝐸(𝜃̂_𝑗)] ≈ 1 − 𝛼 (10.1.83) An approximate hypothesis test for 𝛼 level of significance can be written as represented below:

𝐻₀: 𝜃_𝑗 = 𝑐 (10.1.84)

𝐻₁: 𝜃_𝑗 ≠ 𝑐 (10.1.85)

112 where 𝑐 is any specified number. The test can be performed as follows (Graybill & Iyer, p.

610):

1. Compute 𝑡₀ = ^𝜃^̂^𝑗^−𝑞

𝐴𝑆𝐸(𝜃̂_𝑗), 2. Reject 𝐻₀ if |𝑡₀| > 𝑡_{𝑛−𝑘−1}^𝛼/2 .

10.1.2.2 Nonlinear Regression Diagnostics

Similar to the case of the linear regression, the assumptions underlying a nonlinear regression should also be checked for their validity. The assumptions in nonlinear regression models are listed below (Ritz & Streibig, 2008, p. 55):

1. The mean function is correct,

2. The variance of the errors are homoscedastic, 3. The errors are normally distributed,

4. The errors are not auto correlated.

It can be inferred that the previously mentioned techniques in linear regression diagnostics can be similarly applied on nonlinear regression. See Chapter 5 and Chapter 6 (73-91) for information on the corresponding diagnostic tests and remedies for model violations in nonlinear regression models in Ritz and Strebig (pp. 55-70) respectively.

113

The Box-Jenkins Method of Time Series Analysis

In document The development of the turkish power market with special respect to renewable power generation in Turkey (Page 136-143)