Parallelization of the Learning Algorithm
Algorithm 9 One iteration of the learning algorithm
----=
e s
p
---=
Algorithm 9 - One iteration of the learning algorithm
... compute -- first part of Algorithm 2
... compute (4.59);
... compute o(q) -- complete Algorithm 2
... compute J -- Algorithm 1
... compute r (4.53) and update -- see Algorithm 6 if r>0
vold := v; Jold := J; eold := e end
... compute p (4.48) vold := vold+p
... compute ep (4.50) O(q–1)
uˆ w(q–1) = uˆ e := t–o( )q
υ
υ
b) - (4.59);
c) J - Algorithm 1;
d) p - (4.48);
e) ep - (4.50).
Among these blocks, the one with highest complexity is undoubtedly the computation of the LM increment, p. A good overall efficiency for the parallel training algorithm can only be obtained if this block is efficiently parallelized. And as block (b) is similar to block (d), in the sense that both can be formulated as least squares solutions of linear systems, a good algorithm (in terms of parallelization efficiency) for this type of problem should be sought.
As block (e) has the lowest complexity and also is easily parallelized, our efforts will be concentrated on the parallelization of the recall operation, the Jacobian, and in particular, the least squares solution of linear system equations. For each of these blocks, the possibilities for parallelization will be discussed and a parallel algorithm will then be chosen. Finally, these algorithms will be integrated and, together with the parallelized versions of the remaining steps of Algorithm 9, a parallel version of the complete algorithm is obtained.
Because of its importance for the overall efficiency of the learning algorithm, the parallelization of the least squares solution of overdetermined systems of linear equations will be discussed first.
5.4 Least squares solution of overdetermined systems
This is a very important problem, which appears in a broad range of disciplines (for instance, control systems, optimization, statistics, signal processing, etc.). The basic problem is to derive a vector x, such that:
(5.3) where the dimensions of A are m*n, . Usually this system has no solution. A commonly used alternative is the minimization of:
(5.4) for a suitable norm p.
As the use of the 2-norm makes (5.4) a continuous differentiable function of x, and therefore makes the problem easier to solve than employing the 1-norm or the -norm, the usual least-squares formulation is the minimization of (5.5).
(5.5) uˆ
Ax = b m≥n
ρ = Ax–b p
∞
ρ = Ax–b 22
In Chapter 4 it has already been mentioned that the use of a pseudo-inverse allows us to express the solution of this last equation in a compact form:
(5.6) and, when A is full-column rank, x can be obtained as:
(5.7) Eq. (4.59) is already in the form of (5.6). The computation of p can also be expressed in this form if J and e in (4.48) are replaced by J’ and e’, where
(5.8)
(5.9) Using these modified equations, the least-squares solution of
(5.10) is, using (5.6) to (5.9):
(5.11)
Eq. (5.5) may be solved by means of (5.7) but, in fact, it is not the recommended solution for the current problem. It was already mentioned that O(q-1) and J are ill-conditioned matrices. If (5.7) is used, this ill-conditioned problem is aggravated, since [135], and unnecessary magnification of round-off errors, which can even lead to numerical singularity of , occurs.
Fortunately there are methods that can be used to solve (5.5) that do not exacerbate the ill-conditioning of the problem. Two factorizations of the matrix A can be used: QR and singular value decomposition (SVD). The latter has an advantage when A is rank deficient. In this case, there is an infinite number of solutions for (5.5) and SVD is able to compute the one with the smallest norm, in contrast to QR. On the other hand, QR factorization requires fewer computations than SVD.
x = A+b
x = (ATA)–1ATb
J' J
= υI
e' e 0
=
J'p = –e'
p –J'+e' –(J'TJ')–1J’Te’ (JTJ+υI)–1 JT υI e 0 –
= = =
JTJ+υI
( )–1JTe –
=
κ A( TA) = (κ A( ))2
ATA
As it is assumed that is always full-column rank, and the computation of p is guarded against the loss of full rank, QR decomposition will be used for the solution of least-squares problems.
As its name indicates, this factorization decomposes a matrix A into an orthonormal matrix (Q) and an upper triangular matrix (R1), such that:
(5.12) where Q is m*m, R is m*n, Q1 is m*n, Q2 is m*(m-n), R1 is n*n and 0 is (m-n)*(m-n).
To solve the least-squares problem, (5.12) can be replaced in (5.5), and taking advantage of the property of orthonormal matrices:
(5.13) eq. (5.5) can then be transformed into:
(5.14) This last equation is minimized when the first term in the right hand-side vanishes, so the vector x can be given as:
(5.15) where
(5.16) Note that Q1 does not actually have to be computed; only y must be obtained. Usually QR methods for solving least-squares systems divide their operation into two phases: a triangularization phase, where R1 and y are computed, and a solution phase, where the triangular system is solved, usually by back-substitution. For the first phase, there are, however, several algorithms to choose from. These are covered in the following section.
5.4.1 Sequential QR algorithms
There are several algorithms that can be used to compute the triangularization phase, but it is usually recognized [135] that the Householder method, the fast Givens method and the modified Graham-Schmidt method are the most important algorithms for performing this task.
O(q–1) 1
A QR Q1 Q2 R1 0
= =
QTQ = I
ρ = R1x–Q1Tb 22+ Q2Tb 22
x = R1–1y
y = Q1Tb
As the three methods have similarly good numerical properties, it was decided to parallelize the one that involves the smallest sequential computational cost. If a good parallel efficiency can be obtained the problem is then solved; if not, another candidate method must be parallelized.
In terms of sequential computational cost, the modified Graham-Schmidt method is more expensive than the Householder algorithm [135]. Fast Givens approaches were developed as a rearrangement of the Givens method, so that they could be performed with
‘Householder speed’. Although a straightforward fast Givens method has the same computational complexity as the Householder method, it is found in practice that monitoring for overflow of terms makes fast Givens methods slower, and more complicated to implement, than the Householder approach [135].
Based on these considerations, the Householder scheme was the algorithm chosen to parallelize.
This algorithm is based on the repeated use of elementary reflectors or Householder matrices:
(5.17) The crucial point in favour of elementary reflectors is that they can be used to introduce zeros into a vector. Given a vector x, an elementary reflector can be found such that:
(5.18) where e1 denotes a vector of zeros, with the exception of the first element, which is unitary.
It can be shown [172] that the following algorithm computes both and the Householder vector (h) that satisfies (5.18):
The above mentioned property of elementary reflectors can be used to triangularize a rectangular matrix, by successive applications of Householder matrices:
H = I–hhT
Hx = –σe1
σ
Algorithm 10 - Computation of and h σ ζ := x 2
σ := sgn( )x1 ζ µ := ζ+ x1
x1:= sgn( )x1 µ π := ζµ
h := x ---π
(5.19) where A is assumed to have n columns. To see this, let us suppose that, at the end of iteration k, the matrix A[k] has the following form:
(5.20) where Ris an upper triangular matrix of size k*k. Denoting H[k+1] as:
(5.21) where H satisfies (5.18), it is easy to see that:
(5.22) so that, at each iteration, another column of A is put in a triangular form.
Comparing (5.12) with (5.19) it is clear that the orthogonal matrix Q can be obtained from the Householder matrices by:
(5.23) When the Householder technique is employed for solving least-squares problems (5.5), there is no need to actually compute the matrices H[k]. The following algorithm can be used instead:
H n[ ]H n 1[ – ]…H 2[ ]H 1[ ]A A n[ +1] R1 0
= =
A k[ ] R r B 0 x C
=
H k[ +1] I 0 0 H
=
A k[ +1] H k[ +1]A k[ ] R r B 0 –σe1 HC
= =
Q = (H n[ ]H n 1[ – ]…H 2[ ]H 1[ ])T