Joining sparse and non sparse vectors into single vector operations

9.3 Linear Solver

9.3.3 Joining sparse and non sparse vectors into single vector operations

FPGA Volumetric Graph-Based SLAM Combination of sparse and non-sparse Conjugate gradient

Joining sparse and non sparse vectors

into single vector operations

Memory duplicates for parallel lookup

Sequential lookup (storing singular values)

Large NSV indexed by multiplexer

The linear solver contains sparse vectors with a fixed size to represent the information matrix and ordinary vectors to represent everything else. The conjugate gradient algorithm does only contain one operation that concerns both these sparse and non-sparse vectors. This operation is present in the initialization phase of the algorithm (A~x) and in the iterative part of the algorithm (A~p). Effectively, the matrix vector multiplication consists of the system size amount of dot products. These dot products between ordinary vectors is a trivial problem in math, because it is just a pairwise multiplication of two vectors and the sum of the resulting vector.

9.3.3.1 Coupling sparse vectors

When the dot product is performed on two sparse vectors, the operation is functionally the same. However, i1f two sparse vectors are multiplied, it is not possible to do the multiplications without finding the corresponding indices in the vectors. An example of a dot product of two sparse vectors is shown in equation 9.1. Only the non-zero parts that have corresponding indices will be used in the final answer. Therefore, the dot product can be rewritten into vectors that only contain the items with matching indices.

      (0,5) (2,5) (4,5) (6,5) (8,5)       •       (0,5) (1,5) (2,5) (3,5) (4,5)       '   (0,5) (2,5) (4,5)  •   (0,5) (2,5) (4,5)   (9.1)

The introduction of sparse matrices and doing calculations with them results in less com- putations, but they create a certain amount of overhead that need to be taken into account before working with them. In this case, the overhead that is introduced is the coupling of the indices of the sparse matrices. In the particular case of a dot product, there is an advantage. Because a dot product is a multiplication of the items corresponding to an index, every index of one sparse vector (of which the item is always non-zero) missing in the other sparse vector will become zero and will not alter the final outcome of the dot product. Therefore, it is only necessary to check the presence of the indices of one vector and multiply these in order to find the final solution of the dot product.

Figure 9.5 shows an abstract structure of the sparse vector coupling. Each index of v~1 will be looked up in the indices of v~2 in the green block. If the index is present in v~2, the corresponding value is returned, otherwise a zero is produced. Producing a zero means the multiplication will be done, but the outcoming value will always be zero, instead of not doing the multiplication at all like in equation 9.1. In a static hardware structure it is not beneficial to skip the multiplication, because the possibility of doing the multiplication means that the

Figure 9.5: Value lookup of sparse vector index to perform a dot product

hardware is already available. In the figure the only one lookup block is shown, in the fully parallel solution, each index will go to a lookup element.

9.3.3.2 Coupling a sparse vector to a non-sparse vector

In the previous paragraph the coupling of two sparse vectors has been discussed. In the conjugate gradient algorithm the the dot products will be performed between a sparse vector and a non-sparse vector. The problem with this different vectors is the fact that the way the items are stored is different. The non-sparse vector is just a plain vector with the system size amount of items. Effectively, each index of the sparse vector needs to be looked up in the non-sparse vector. This lookup can be done by using large multiplexers, but large multiplexer will use more area on the FPGA. Another method is using memories in which the it is easier to find the value without large multiplexers. However, memories are sequential, an address is provided and that address will be fetched or written to in the next clock cycle. A fully parallel implementation does not contain multiple clock cycles since the complete implementation is a single combinatorial path.

If synchronous memories are used, only one block can be fetched out of a memory at a time where each block can contain one or multiple items. If a memory only contains one item per block, the address can be provided at the read-address input of the memory and the correct value will directly be available at the read data output in the next clock cycle. If the block contains multiple items, the required element needs to be extracted from the block it is in with the help of a multiplexer with the size of the block.

The disadvantage of using memories for the storage and lookup of vectors is that they are synchronous, which means they are dependent of a clock input and will only present one block of data at a time. If the sparse matrix has a fixed amount of items which need to be coupled to values from the non-sparse matrix which are all in different blocks, each block needs to be fetched, the correct data should be extracted and stored, after which the next block can be fetched in the next clock cycle. The amount of clock cycles needed to do for example the final complete sparse matrix non-sparse vector multiplication will take five times as long.

Instead of extracting the data with a large multiplexer or fetching it from one memory over

Figure 9.6: Structure of finding and fetching the correct non-sparse vector block and the correct value value within the found block. The architecture that is connected to the first element of the sparse vectors appears for each element.

time, it is also possible to use multiple memories, each responsible for fetching one item for from the non-sparse vector corresponding to the index of the sparse vector. It is possible to use the available blockram memory on the FPGA as multiple parallel accessible small memories. Each of these small memories contains a copy of the complete non-sparse vector from which values needs to be coupled to a sparse vector’s index.

Figure 9.7 shows the structure sparse vector to non-sparse vector coupling with multiple memory copies holding the non-sparse vector. Because the memory will hold blocks instead of separate values, which will be discussed in the next section, the blocks still need to be multiplexed to find the actual corresponding value.

9.3.3.3 Conclusions on sparse to non-sparse vector coupling

To realize vector operations between sparse and non-sparse vectors, memory duplicates of the non-sparse vectors are used. A parallel lookup of the multiple items in the vector would consume a lot of multiplexers and a sequential look-up would result additional clock cycles which results in a much slower overall computation time. The indices of the sparse vectors are to fetch the correct vector parts in the memory. Once the vector parts have been fetched, the correct values are extracted from the vector parts.

9.3.4 Implementation structure

In document Analysis, optimization, and design of a SLAM solution for an implementation on reconfigurable hardware (FPGA) using CλaSH (Page 80-82)