• No results found

3.3 Hydrodynamic WEC parameter estimation

3.3.2 Nonlinear optimization

For models that are nonlinear in the parameters, θ, the linear regression techniques, presented in Section 3.3.1, can not be used for model parameter identification. The main idea of optimization is to minimise some LF, J(θ), which is a measure of error between the model prediction and the recorded data. Given a model nonlinear in the parameters, J(θ) will have a global minimum and may have many local minima (see Fig. 3.26) and no analytical solution exists; therefore, the use of an iterative optimization algorithm is required. Good references, in order to find nonlinear optimization details, include [141] [205] [261] [262] [263].

θ1

θ2

J(θ)

ˆθlocal

ˆθglobal

Figure 3.26: In the case of a model nonlinear in the parameters, the LF, J(θ), may have a global minimum and many local minima.

3.3.2.1 Nonlinear optimization method classification

In the literature, there are many optimization algorithms, and no single algorithm is suitable for all problems [264]. Different criteria exist, in order to classify the optimization method, depending on the properties that are compared:

• Criterion 1: algorithms can be classified as line search or trust-region methods [265]:  A line search method is an iterative algorithm in the form (see Fig 3.27):

θk= θk−1+δk−1νk−1∇ (3.88)

where the new parameter vector, θk, is calculated from the previous θk−1, by moving in a direction νk−1∇ , with a step size (also denoted learning rate [266])δk−1. Therefore, at each iteration, the algo- rithm determines, according to a fixed rule, a direction of movement, and searches for a (relative) minimum of the LF on that line. Once the new point is found, a new direction is determined and the process is repeated. Therefore, a line search method first decides the search direction, νk−1∇ , and then chooses an appropriate step length,δk−1[263].

 In a trust-region method, at the step k − 1, the LF J(θ) is approximated, in a trust-region around the current point θk−1, with a parametric version (such as a quadratic approximation), Jk−1(a)(θ)[267]. Then, inside the trust-region, the minimizing algorithm applies the same strategy of a line search method on Jk−1(a)(θ)(instead of on J(θ)). The size of the trust-region may change at

each step and the selected size is important to the effectiveness of each step [265] [268]. A trust- region method first chooses a maximum searching distance (the trust-region size), then calculates a direction νk−1∇ , and finally a step sizeδk−1, in order to obtain the best improvement inside the trust-region. It is very common to utilise a quadratic function for Jk−1(a)(θ).

• Criterion 2: algorithms can be classified as deterministic or stochastic methods [264]:

 A deterministic algorithm works in a mechanically deterministic way, without the introduc- tion of any random decision; therefore, given the same initial point, the algorithm will reach the same final solution.

 In a stochastic algorithm, the presence of some randomness in the algorithm, will probably lead to a different solution every time the algorithm is run, even if the problem to solve is the same [205] [264].

• Criterion 3: algorithms can be classified as trajectory-based or population-based methods:  A trajectory-based algorithm, at each step, calculates a single solution point, which will trace out a piecewise zig-zag path, as the optimization process continues (see Fig. 3.27).

 A population-based algorithm calculates, at each iteration, multiple solutions, which will interact with each other to generate a new set of solutions [264]. Examples of population-based algorithms are genetic algorithms (GA) and particle swarm algorithms [261].

θ0

θ1

θ2

θk

θk−1

Figure 3.27: Trajectory drown by an iterative line search method, in the case of θ ∈ ℜ2and of a LF with two minima. The algorithm starts from the initial point θ0. The closed lines represent the contour lines of the LF.

3.3.2.2 Common nonlinear optimization algorithms

The iterative algorithm of equation (3.88), utilised for line search methods, provides a common foundation for a large number of optimization algorithms. Indeed, the search direction, νk−1∇ , can be written as the LF gradient,∇J(θk−1)(in general,∇J(θ) always indicates the direction of the steepest ascent in θ, and −∇J(θ) the direction of the steepest descent [266]), rotated and scaled by some direction matrix,Rk−1[205]:

νk−1∇ =−Rk−1∇J(θk−1) (3.89)

Different choices ofRk−1lead to different optimization methods, as explained below:

• Newton’s method. In this case, Rk−1 is chosen as the inverse of the Hessian matrix of the LF, calculated at the point θk−1[141] [205]:

Rk−1=  ∇2J(θ k−1) −1 (3.90)

Compared to the steepest descent method [205], Newton’s method has a faster convergence but, at the same time, has the requirement for second order derivatives that, if not available analytically, have to be computed utilising finite difference techniques. Therefore, if the Hessian matrix is not known analytically, Newton’s method becomes computationally expensive, even for medium sized problems. Another important drawback is the required inversion of the Hessian matrix, which strongly limits the size of the studied problem [205].

• Quasi-Newton method. The calculation of the Hessian matrix and its inversion, necessary in Newton’s method, can be replaced by an appropriate approximation, involving first order deriva- tives alone, obtaining in this way the family of quasi-Newton methods [141] [205]. The most common algorithms, in order to calculate the approximation of the inverse of the Hessian for the quasi-Newton method, are the DFP (Davidon, Fletcher, Powell) and the BFGS (Broyden, Fletcher, Goldfarb, Shanno) algorithms. For more details, see [141] [205] [265].

.

• Conjugate gradient method. In all quasi-Newton methods, the memory requirement and the computational complexity increase quadratically with the number of parameters. Therefore, for large problems, the approximation of the Hessian matrix is not convenient. As an alternative, given a q-dimensional parameter vector, θ, conjugate gradient methods utilise q different search directions νi∇, each one conjugate with the others (νi∇and ν∇j are conjugate, with respect to the symmetric positive definite matrixAc, if (νi∇)TAcν∇j =0 [265]), without a direct approximation of the Hessian matrix. In conjugate gradient methods, the memory requirement and the compu- tational complexity increase linearly with the number of parameters q [205]. A drawback is the larger required number of iterations for convergence, compared to quasi-Newton methods. The conjugacy of the search directions tends to deteriorate during the running of the algorithm; there- fore, a typical solution is to restart the algorithm after every q steps, by imposing the search vector equal to the negative gradient direction [205]. For more details see [265] [269] [262] [263]. .

• Genetic algorithms. Genetic algorithms are stochastic, population-based optimization tech- niques, based on Darwin’s theory of natural selection. The great success of natural evolution in the development of new species, which are able to adapt to changing environmental conditions, sug- gests an innovative approach to mathematical optimization problems [270]. Each possible solution to the optimization problem, in GAs, is represented by an individual (one point in the parameter space) belonging to a population (a set of possible solutions). During each successive generation, a GA selects a sub-population, by implementing a fitness-based process, where fitter solutions (as measured by a LF) are typically more likely to be selected. The selected sub-population is combined to originate a new generation, such that the average ‘quality’ of the new population is improved. Each solution is coded in a string (the chromosome), by utilising an encoding tech- nique (e.g. binary and floating point coding [271]), which represents the genetic information of the individual. By operating on the strings of the selected ‘parent’ solutions, a new individual is generated. Three main genetic operations are utilised to create the new generation: crossover, mutation and elitism [205] [261] [272] [273].

3.3.2.3 ANN model identification

The identification of a MLP-ANN, introduced in Section 3.2.3.6, involves tuning the value of weights and biases of the network, in order to optimize the network performance on the available input/output training data. By utilising the MSE as measure of the error, defined in (3.39), from equations (3.33) and (3.35) it is possible to write:

J(θann) = 1 2 N

k=1 h

y(k) − ˆy(k,θann) i2 = 1 2 N

k=1 h y(k) −

n2 i=0 w(out)i Ψi  n1

j=0 w(2)i j Ψj nv

l=0 w(1)jl vl(k)i 2 , (3.91)

where a MLP-ANN with two nonlinear hidden layers and one linear output layer is utilised. The unknown parameter vector is given by:

θann= h w(1)11 ... w(1)n1nv w (2) 11 ... w(2)n2n1 w (out) 1 ... w(out)n2 iT . (3.92)

Equation (3.91) shows that J(θann)is not a quadratic function of θann; therefore, it is not possible to utilise the linear optimization methods introduced in Section 3.3.1. Historically, a common way to train a MLP-ANN is the use of the back-propagation algorithm [238] [245] [266]. It is possible to show that the methodology applied by the back-propagation algorithm is just an equivalent way to calculate the search direction and the learning rate, utilised by the recursive derivative- based techniques (3.88) [205] [266]. In the context of a MLP-ANN training, usually, each single iterative line search step is denoted an epoch. Any of the recursive trajectory-based algorithm for nonlinear optimization, presented in Section (3.88) can be used for a MLP-ANN training [244] [274] [275]. In particular, when the number of neurons increases, a good choice is the conjugate gradient algorithm, which is computational efficient and shown to have good performance for ANN models [244] [245] [274] [276]. A MLP-ANN can be retrained on the same data over successive epochs (batch training mode); in this way, all the data training set is applied to the network before weights and biases are updated. An alternative approach is the calculation of new parameter values at each time step (incremental training mode); therefore, the utilised data are different at each epoch. For most problems, a batch training has a faster convergence then an incremental training [244]. In addition, population-based algorithms, such as GAs, can be utilised for a MLP-ANN identification [277].

Section 3.1.1.2 has already introduced that, when the model complexity increases, the identi- fied model becomes more flexible and able to show more sophisticated dynamical behaviour but, at the same time, unnecessarily high complexity can render the model less capable of generalising on new data. In Section 3.2.4, it has been explained that na, nb and nd of a DT nonlinear model can be estimated by implementing a systematic trial and error process on several ARX models. In the case of MLP-ANN models, the model complexity, in addition to na, nband nd, is given also by n1and n2, which are the number of neurons in the two hidden layers (in the case where the two hidden layer ANN structure, described by (3.33), is utilised). It is not straightforward to calculate the optimal values for n1and n2, but it is known that the model performance is poor, if the network is not complex enough and, on the other hand, there is the risk of overfitting, if the network is too complex [244].