Case-study: Adaptive QR Matrix Decomposition

3.6 Related Work

4.5.1 Case-study: Adaptive QR Matrix Decomposition

The objective of this case is to model embedded system mappings which can be explained as: ”Create the accurate one-on-one mappings of Kahn PNs onto multi-processor with compile- time pipelining of symbolic instructions7_{, where the application process networks are pre-}

created and cannot be changed.”

The restriction on changing application process networks implies that the mapping transfor- mations can happen only at the level of symbolic programs (i.e., we must apply the Trans- forming Step). Thus, in this sub-section we give (1) a description of the mapping case we conducted and (2) experimental results to support our claims about accuracy, efficiency and the exploration power of the mapping approach presented in this thesis.

The case is based on an algorithm commonly used to solve an over-specified set of linear equations in a least squares sense. This algorithm is known as adaptive QR matrix decomposition [79]. In signal processing practice, this algorithm is used for calculating weights in an adaptive beam-forming system [80]. We performed system-level exploration of different mappings of the QR algorithm onto an FPGA platform as described in [52]. For an under- standing of this case it is necessary to give a specification e.g., in the form of a sequential algorithm inMatlab. See Figure 4.2. Ther(m, n)are entries of an upper triangular matrix

Rof sizeN×N that is updated at eachk-step, the x(k, p)are entries of a vector of size N that are taken from a source consisting ofN sensing devices called antenna data in the remainder of this section, andθ(p)is a vector of sizeN that represents the orthogonal matrix Qof sizeN×Nin the decompositionX =QR, whereX is the stack of all vectors of size Ncollecting thex(k, p)entries. For the case of simplicity we have assumed thatX,Q, and Rare real-valued. 1 for k=1:1:K, 2 for j=1:1:N, 3 [r(j,j),x(k,j),θ(j)]=Vectorize(r(j,j),x(k,j)); 4 for i=j+1:1:N, 5 [r(j,i),x(k,i),θ(j)]=Rotate(r(j,i),x(k,i),θ(j)); 6 end 7 end 8 end

Figure 4.2: A QR matrix decompositionMatlabcode sample.

Description of The Case

We modeled three different mappings of the adaptive QR algorithm onto an FPGA platform. For the first mapping, the algorithm is modeled as a process network of four communicating processes. The network is shown in Figure 4.3, part 1. For the second mapping, the algorithm is modeled as a process network of eight communicating processes. The network is shown in Figure 4.3, part 2. Finally, for the third mapping, the algorithm is modeled as a process network of twelve communicating processes. The network is shown in Figure 4.3, part 3. All the networks were derived automatically from the sequential algorithm in Figure 4.2, using

theCOMPAANtool-set [53].

We represented the networks using symbolic programs and control traces. We modeled the FPGA platform using components from the repository of the architecture model components depicted in Figure 3.1. In the experiments, we use the following architecture plus mapping specifications (for each mapping there is one architecture plus mapping specification):

1. Binding: The number of processor components in the architecture is equal to the num- ber of processes in the QR process network. In other words, each application process is mapped onto a single processing unit in a 1-on-1 fashion (see one-to-one in Chapter 3, Section 3.4.2).

2. Binding: The number of FIFO components in the architecture is equal to the number of channels in the QR process network. Each application FIFO channel is mapped onto a single FIFO component in a 1-to-1 fashion.

3. Binding: There is no resource sharing, neither for computation (an operating system is not needed since there are no different threads on any processing unit) nor for com- munication (a bus is not needed since all buffers are dedicated).

4. Transforming: The number of simultaneousreadandwriteoperations in the architecture is explicitly shown in the symbolic programs (see the example in Chapter 2,

Antenna Data Antenna Data Vectorize Rotate Vectorize Vectorize Rotate Rotate Vectorize Vectorize Rotate Rotate ND4 ND6 ND8 ND10 ND12 ND9 ND11 ND7 ND5 ND3 ND1 ND2 3 Antenna Data Parameter Data 1 Vectorize Rotate ND4 ND3 ND1 ND2 Parameter Data

Vectorize Vectorize Vectorize

Rotate Rotate Rotate

ND4 ND1 ND2 ND7 ND6 ND8 ND5 ND3 2 Parameter Data

Figure 4.3: The three application QR process networks (Derivation of these process networks is not subject of this thesis, for more information about that please refer to [59].)

Figure 2.13).

5. Matching: Each operation (read,write,execute) takes a single processing unit cycle when executed in the architecture.

6. Matching: From the architecture network point of view,readandwriteoperations cause additional delays: a cycle for switching and a cycle for a FIFO buffer access. The FIFO buffer access cycle appears only when blocking on the FIFO takes place. 7. Matching: FIFO buffers in the architecture are sized so as to provide enough space (in

this case study, for the three mappings the FIFO buffer sizes are always 256 tokens). Based on the above mapping specification, the nine simulation programs for the nine QR- on-FPGA mappings have been synthesized: (1) There are three application process networks as shown in Figure 4.3; (2) There are two different SP representations for each application network - the first one contains totally ordered symbolic instructions, the second contains partially ordered symbolic instructions; (3) There are three different architecture and mapping specifications for each application-architecture mapping candidate. The first contains

specifications for the multiprocessor which cannot execute simultaneously multiple symbolic instructions of the same type (read,execute,write). The second contains specifications for the multiprocessor which can execute simultaneously multiple symbolic instructions of the same type. The third contains specifications for the multiprocessor which can both execute simultaneously multiple symbolic instructions of the same type and pipeline communications to FIFO components. These different mapping specifications are the result of the Calibration - we needed three iterations to determine the correct matching & transforming parameters before the synthesized simulation programs provided us with a relative error of about +1.5%. We describe these results in the next section.

Results 1000 10000 100000 4 6 8 10 12 14 Cycle count Number of processors "SP-TLM-1" "SP-TLM-2" "SP-TLM-3"

Figure 4.4: The simulation results of the adaptive QR matrix decomposition case-study.

Number of processors FPGA cycle count TLM cycle count relative error

4 29281 29458 0.6%

8 9771 9884 1.2%

12 6111 6202 1.5%

QR onto the FPGA QR with the TLM model

10 hours 10 seconds

Table 4.2: Required mapping & simulation times: QR on FPGA vs. QR on TLM. We con- sider mapping equal to compilation, and we consider simulation equal to execution. The necessary preparations and adaptations of the application and architecture models have not been taken into account.

We run the executables of our nine mappings, and the results are shown in Figure 4.4. There are three SP-TLM labels 8 _{which refer to three gradually differing mapping cases: “SP-}

TLM-1” refers to mappings of totally ordered SPs onto multiprocessors which cannot execute simultaneously multiple symbolic instructions of the same type; “SP-TLM-2” refers to mappings of partially ordered SPs onto multiprocessors which can execute simultaneously multiple symbolic instructions of the same type; and, “SP-TLM-3” refers to mappings of partially ordered SPs onto multiprocessors which can both execute simultaneously multiple symbolic instructions of the same type and pipeline communications to FIFO components. To quantify the results we show the comparison of the mapping case “SP-TLM-3” versus actual FPGA mappings of the adaptive QR matrix decomposition in Tables 4.5.1 and 4.5.1. Table 4.5.1 shows the number of cycles needed to complete executions of the different QR networks on the FPGA platform [52] vs. the number of cycles needed to complete the executions of different QR networks on the TLM based model of this platform. As can be seen, the TLM architecture model is able to predict the performance of the real FPGA platform executing the adaptive QR algorithm with a relative error of about +1.5%. For the case in hand, a larger error would have revealed a major flaw in the method. Table 4.5.1 shows that simulation speed is excellent.

4.5.2 Case-study: Mapping 2D-IDCT Specification to IP-primitives

In document Execution platform modeling for system-level architecture performance analysis (Page 98-103)