Complexity - Hardware Acceleration Technologies in Computer Algebra: Challenges and Impact

In Remark 1, we describe one important observation that helps us to compute the memory requirement for our proposed sorting algorithm given in Proposition 14.

Remark 1. Our proposed algorithm creates the list of lists of integers successively.

Once it creates Ak+1 _from _Ak_{, it does not need to store} _Ak _anymore.

Proposition 14. We need O(n+nm/p) words of storage for our proposed sorting algorithm.

Proof⊲The expected number of words required to storeai.v, fori∈ {0, . . . , n−1}, isO(m/p). Accordingly, we can store a list of lists of integers in O(n+nm/p) words

of storage.

Proposition 15 describes the number of times ℓ, we need to call Algorithm 5 to be sure that all lists in Aℓ _have_O_{(1) integers. In Proposition 16, we describe the cost to} order each list in Aℓ _{to find the sorted order of the integers.}

Proposition 15. It is expected that each list of Aℓ _has _O₍₁₎ _{integers where} _ℓ ₌ O(logp(n)).

Proof ⊲_{It is expected that any list} _Ak_h _in_Ak _{has at most}_O₍n

pk) integers. So after calling Algorithm 5 O(logp(n)) times, we expect that all lists in the list of lists of

integers haveO(1) integers.

Proposition 16. Let Aℓ_{, be a list of lists where each of the list has} _O₍₁₎ _{integers as}

described in Proposition 15. We need O(1) comparison operations to order each list

in Aℓ_{. Thus, we need}_O₍_n p2

2(p−1)) andO(n

2(p−1))bit operations to sort n non-negative

dense and sparse integers respectively.

Proof ⊲ _{It follows from Corollary 1 and Corollary 2.}

Proposition 17. The time complexity of our proposed sorting algorithm to sort n

non-negative dense integers is O(nlogn+ plogn(log₂ n+1) +n₂₍_pp₋₁₎2 ).

Proof ⊲ _{Following from Proposition 11, the number of bit operations required for}

computing Alogp(n) is O   logp(n) X j=1 (n+jp)  .

For dense integers, where p is small, we can say logp(n) ≈ logn. Finally, from Proposition 16, we can compute the cost to sort the integers in all lists ofAlogp(n).

It should be noted that, we do not consider the cost for creating [a0.v, . . . , an−1.v],

which is required for dense integers.

Our proposed algorithm may not be suitable for dense integers for two main reasons. First, we need to compute [a0.v, . . . , an−1.v] in advance, which is expensive.

Second, the cost of comparison based sorting algorithm for dense integers given in Corollary 1 might already be good enough for practical purposes. Still, our proposed algorithm has some properties given below which might be useful in practice.

• Our sorting algorithm is stable.

• We can call Algorithm 5 for a number of times to make each list of integers small. Then each of the list can be sorted independently (or in parallel) by any comparison based sorting algorithm. Thus, Algorithm 5 can be used as a preprocessing step of the sorting algorithm.

Proposition 18. The time complexity of our proposed sorting algorithm to sort n

non-negative sparse integers is O(n+p+n₂₍_pp₋₁₎), where logp(n) =O(1).

Proof ⊲ _{It follows from Propositions 11 and 16.}

5.6 Conclusion

Based on theoretical complexity, we can say our algorithm is more suitable for sorting large sparse integers. It can be modified to sort other type of objects as well. For example, in Chapter 6 we apply this algorithm to sort binary reflected Gray codes. Moreover it can be used as a preprocessing step in sorting of dense integers. In our proposed algorithm, we suggest applying the counting sort algorithm as a stable sorting algorithm for intermediate sorting. Cache-oblivious counting sort algorithm of Chapter 4 can be used for this purpose.

Chapter 6 Cache Friendly Sparse

Matrix-vector Multiplication

This work is motivated by the challenges posed, in terms of data locality, by large and unstructured matrices occurring in sparse linear algebra. Our goal is to minimize the cache complexity of sparse matrix-vector multiplication. In a previous work, we experimentally observed that, for an input matrix S, column reordering based on binary reflected Gray code was a practically efficient preprocessing phase, which could be amortized against repeated multiplications of S by a dense vector [29].

In this chapter, we provide a theoretical foundation for the above observation. If

S counts n columns, m rows and has a total number τ of non-zero entries and if S

is sufficiently sparse, we show that the columns and rows of S can be reordered in

O(τ) bit operations, using the RAM model with memory holding a finite number of

w-bit words, for a fixed w. This reordering of columns and rows is inspired by binary reflected Gray code. We establish a cache complexity result for sparse matrix-vector multiplication when the sparse matrix is reordered by our proposed method.

We report numerical experiments which confirm our theoretical results. In partic- ular, we include data for a simulation of the ideal cache model for verifying our cache complexity estimates.

This chapter is a joint work with S. Hossain and M. Moreno Maza.

6.1 Introduction

Sparse matrix-vector multiplication, or SpMxV, is an important kernel in scientific computing. For example, the conjugate gradient method is an iterative linear system solving process where multiplication of the coefficient matrixS with a dense vectorx

is the main computational step accounting for as much as 90% of the overall running time. While the total number of arithmetic operations (involving non-zero entries only) to compute Sx is fixed, reducing the probability of cache misses per operation by preprocessingSremains a challenging area of research. This preprocessing is done once and its cost is amortized by repeated multiplications. Computers that employ cache memory to improve the speed of data access rely on the reuse of data that is brought into the cache memory. The challenge is to exploit data locality especially for unstructured problems like modeling data locality, which in this context is hard [68]. Pinar and Heath [59] propose column reordering to make the non-zero entries in each row contiguous. However, column reordering for arranging the non-zero entries in contiguous location is NP-hard [59]. In a considerable volume of work [38, 29, 59, 69, 71] on the performance of SpMxV on modern processors, researchers propose optimization techniques such as the reordering of the columns or rows of S to reduce indirect access and improve data locality, and blocking to reduce memory load and loop overhead. In [37], the authors describe a number of applications of sparse matrix- vector multiplication.

Here, we present a new row-and-column permutation algorithm, based on binary reflected Gray codes, that runs inlinear time with respect to the number of non-zero entries.

To evaluate these results, we have realized an implementation of our algorithm and analyzed its performance on a set of well-known test matrices. Our experimental results are coherent with our theoretical estimates and demonstrate the performance gains rendered by our permutation algorithm.

The organization of this chapter is as follows. In Section 6.2, we discuss some preliminary materials followed by our proposed re-ordering algorithm in Section 6.3. We analyze our preprocessing algorithm in Section 6.4 and present the experimental results in Section 6.5.

In document Hardware Acceleration Technologies in Computer Algebra: Challenges and Impact (Page 56-59)