• No results found

To the best of our knowledge, neither sparse kernel nor sparse library had been ported to the Cell processor (described in Section4.3.1) when we developed this work. Nevertheless, there were some studies in the area [56] exploring different possibilities for efficient methods in multi-core architectures.

8.2.1

Sparse Matrix-Vector Multiplication (SpMV)

The most demanding part of the solvers developed in Section7.5is theSpMV. Therefore, we decided to start the porting to Cell with this routine, which is shown in Listing 8.1. As we explained in Section 7.5, we worked with the Compressed Sparse Row (CSR) format, which is a common way to store the sparse data.

Listing 8.1 The SpMVroutine.

f o r ( i = 0 ; i < n ; i ++) { c [ i ] = 0 ; f o r ( j = i a [ i ] ; j < i a [ i +1] ; j++) { c [ i ] = c [ i ] + aa [ j ] ∗ b [ j a [ j ] ] ; } }

In our implementation, Power Processing Element (PPE) acted as the execution coordinator and intended to take part as less as possible in the computation. The responsibility of Synergistic Processing Elements (SPEs) was to catch coordination messages and launch computation routines while informing the PPE of what they were doing.

The first step was getting the amount of nonzero elements from the sparse matrix and the number ofSPEthreads that Cell was able to launch. Next, PPEsplit the sparse matrix data as equally as possible among the available SPEthreads.

Data was transferred between main memory and each Local Store (LS) by Direct Memory Access (DMA) instructions. The Cell architecture imposed multiple alignment and size constraints on data transfers. SPEs could transfer only from/to 16 byte boundary. We found that it was more efficient to split the matrix elements not only when a new matrix row began, but also when it did in an aligned memory position. This condition made the code less complex, but it required extra code to check the alignments and assure that transfers would not corrupt any memory position.

Access to the right-hand vector (b array) with double indexation (b[ja[i]]) had direct implications on the splitting method because DMA transfers from main memory were easier to control when data was in consecutive positions since only first position and size were required. We developed two different approaches to face the double indexation issue depending on who prepared the elements of the right-hand vector (b array).

PPE Gatherer

The PPEwas responsible for preparing a working array with the elements of the right- hand vector (b array) in a way that they matched with the nonzero matrix elements for the multiplication. The working array was split in the same way as the matrix. The DMAtransfers of this working array supplied the elements from the right-hand vector in the manner that they corresponded one by one with the matrix elements, and as a result SPEs could start computation as soon as the transfers were completed.

Figure8.2 shows full execution time of three calls to theSpMV routine. The top line is the PPE thread and the lines below correspond to the eightSPE threads launched after matrix has been loaded. It is obvious that SPEs suffer from starvation because light blue color means idle and therefore the expected performance is poor. We can summarize the major bottleneck in this implementation is the PPE.

Figure 8.2: Trace view for three calls to the SpMVroutine using the "PPE gatherer".

SPE Gatherer

The goal was to remove the PPE scalar loop that prepared the working array before any SpMV call. The solution kept the right-hand vector as it was in memory and the SPE was in charge of transferring its assigned chunk (element by element). We took into consideration several methods to reach the best efficiency on multiple small DMA transfers, but finally we decided to implement a variation of the method described in [84]. This timeSPEs got elements of the right-hand vector through a DMAlist. A new data structure was required in the main memory to contain the addresses of the right-hand vector and their memory alignments (0 or 8). This new structure was due to a memory constraint inSPEwhen less than 16 bytes were moved via DMAtransfers, in this case the data kept the same alignment on main memory and on LS. The structure was computed once by the PPEand SPEs were free to check it before each transfer.

The procedure was as follows. Firstly,SPEs transferred addresses and alignments. Secondly, each SPE configured the DMAlist with the addresses and launched it to get the element values. Since each element had its own main memory alignment, it was impossible to get them consecutive from main memory. In our implementation, SPEs allocated double the amount of memory than the transfer represented and each element occupied bytes 0-7 or 8-15 without conflicts. Next, a compression method took the alignments and removed the gaps between elements.

Figure8.3contains three iterations of this implementation and confirms that nowPPE thread has not gather time and SPE are now performing the gather step and theSpMV. The white color shows that SPEis working most of time due to this new approach.

Figure 8.3: Trace view for three calls to the SpMVroutine using the "SPE gatherer".

8.2.2

Evaluation

We have introduced our implementation of the SpMV routine into a conjugate gradient solver to test its performance.

Figure 8.4 and Figure8.5 present time and speedup of this solver with two different matrix sizes (0.47M and 3.5M of nonzero elements) on one Cell processor of MariCel. Both graphics show that the implementation withSPEgatherer performs better although it increases significantly the complexity of the SPE and PPE codes.

Figure 8.6 shows the time computing per iteration with a matrix of 4.5M nonzero elements. After comparing the results, the Cell processor (Maricel) is unable to reach the performance of the PowerPC processor (Marenostrum II ).

To sum up, although we made real efforts to exploit all the resources that the Cell processor brought to us, a good performance was not reached. Our humble impression is that this architecture was conceived for dense computing, such as video-games, but scientific computing requires a good performance with sparse data. Therefore, the constraints of the Cell processor did not allow to provide a good performance with sparse data and also made the programming a tedious task.

0 5 10 15 20 25 30 35 40 2 3 4 5 6 7 8 T ime (s) # SPE threads SPE gather, nnz=0.47M PPE gather, nnz=0.47M SPE gather, nnz=3.57M PPE gather, nnz=3.57M

Figure 8.4: Time for the two gather implementations on MariCel.

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Speedup # SPE threads PPE gather, nnz=0.47M SPE gather, nnz=0.47M PPE gather, nnz=3.57M SPE gather, nnz=3.57M

Figure 8.5: Scalability for the two gather implementations on MariCel.

0 20 40 60 80 100 120

PPC970 CELL gather PPE CELL gather SPE

T

ime (ms)

System