Future Work - Summary and Future Work - Algorithm/architecture codesign of low power and high p

Chapter 7. Summary and Future Work

7.2. Future Work

In the following, we will point to some of the future directions that could expand this multi-dimensional algorithm/architecture codesign space. We briefly cover each category of potential future research.

Micro-Architecture Level. PE and LAC designs may need further modifications in their logic and architecture to provide facilities for supporting more applications. An example is the design of floating-point units that can operate at variable precision or extending capabilities of the PEs to provide functionality for more special functions like Cordic. Furthermore, we have to

design the logic for the core interface to on-chip memory and study its design tradeoffs.

System-Level Explorations. System-level integration is an important di- rection that opens up multiple research topics. The host interface for integration of one or more LAPs (or LACs) with one or more on-chip or off-chip host processors is part of system level development. We will try to clarify more design space details of the LAP when it is placed in heterogeneous systems. To achieve this, we plan to extend our cycle accurate simulator and integrate it into other multi-core simulators like MARSSx86 [102] or GEM5 [18] to study detailed design tradeoffs both at the core and chip level. These details in- clude invocation, completion, memory addressing and task granularity (see Section 2.2.4). In a heterogeneous system, tasks have computational cost, and there is communication cost as data is moved between resources. A research di- rection can be to investigate how to best perform course-grain task scheduling and load balancing to exploit heterogeneous multicore architectures.

Software Techniques and Programming Interface. Future research directions more on the software side includes integration with existing libraries and using software techniques to optimize performance. We plan to collab- orate with members of the FLAME research group in order to integrate our proposed LAP with libflame [138], a modern alternative to the widely used LAPACK [12] library. Advanced software techniques like loop fusion could be

used in our codesign process to further optimize kernels and take advantage of data locality on target architectures.

Generalization. The goal of generalization is to map more algorithms on the LAC and analyze the associated cost in power and efficiency. In the end, a design space spectrum of flexibility and performance versus efficiency can be derived from this study. We plan to implement the collective communication routines for the hardware interconnect between PEs and add necessary hardware if needed. Furthermore, it becomes worthwhile to investigate widely used operations like Singular Value Decomposition (SVD) in the domain of linear algebra. We could try to go beyond FFT and codesign the LAC to map a wider class of signal processing applications as well. Finally, algorithms like Multi-Layer Perceptron (MLP), and Local Linear Model Tree (LOLIMOT) are based on computations on huge data sets that are processed as matrices [109]. We aim to study trade-offs and costs of adding such functionalities to the LAC.

Appendix A

Core Level Extensions for

Matrix Factorizations

Within the dense linear algebra domain, a typical computation can be blocked into sub-problems that expose highly parallelizable parts like GEn- eral Matrix-matrix Multiplication (GEMM). These can be mapped very efficiently to accelerators. However, many current solutions use heterogeneous computing for more complicated algorithms like Cholesky, QR, and LU factorization [7, 143]. Often, only the most parallelizable and simplest parts of these algorithms, which exhibit ample parallelism, are performed on the accelerator. Other more complex parts, which are added to the algorithm to overcome floating point limitations or which would require complex hardware to exploit fine grain parallelism, are offloaded to a general-purpose processor. The problem with heterogeneous solutions is the overhead for communication back and forth with a general-purpose processor. In the case of current GPUs, data has to be copied to the device memory and then back to the host memory through slow off-chip buses. Even when GPUs are integrated on the chip, data has to be moved all the way to off-chip memory in order to perform transfers between (typically) incoherent CPU and GPU address spaces.

While the CPU could be used to perform other tasks efficiently, it is wasting cycles synchronizing with the accelerator and copying data. Often times the accelerator remains idle waiting for the data to be processed by the CPU, also wasting cycles. This is particularly noticeable for computation with small matrices.

In this appendix, we propose a new solutions that try to avoid all inefficiencies caused by limitations in current architectures and thereby overcome the complexities in matrix factorization algorithms. The problem is that architecture designers typically only have a high-level understanding of algorithms, while algorithm designers try to optimize for already existing architectures. Our solution is to revisit the whole system design by relaxing the architecture design space. By this we mean allowing architectural changes to the design in order to reduce complexity directly in the algorithm whenever possible. Thus, the solution is to exploit algorithm/architecture co-design. We add minimal, necessary but sufficient logic to the LAC design to avoid the need for running complex computations on a general-purpose core.

A.1 Related Work

Implementation of matrix factorizations on both conventional high performance platforms and accelerators has been widely studied. Many existing solutions perform more complex kernels on a more general-purpose (host) processor while the high-performance engine only computes paralellizable blocks of the problem [7, 143].

The typical solution for LU factorization on GPUs is presented in [143]. The details of multi-core, multi-GPU QR factorization scheduling are discussed in [7]. A solution for QR factorization that can be entirely run on the GPU is presented in [71]. For LU factorization on GPUs, a technique to reduce matrix decomposition and row operations to a series of rasterization problems is used [44]. There, pointer swapping is used instead of data swapping for pivoting operations.

On FPGAs, [151] discusses LU factorization without pivoting. How- ever, when pivoting is needed, the algorithm mapping becomes more challeng- ing and less efficient due to complexities of the pivoting process and wasted cycles. LAPACKrc [49] is a FPGA library with functionality that includes Cholesky, LU and QR factorizations. The architecture has similarities to the LAP. However, due to limitations of FPGAs, it does not have enough local memory. Similar concepts as in this document for FPGA implementation and design of a unified, area-efficient unit that can perform the necessary computations (division, square root and inverse square root operations that will be discussed later) for calculating Householder QR factorization is presented in [13]. Finally, a tiled matrix decomposition based on blocking principles is presented in [130].

A.2 Hardware Extensions

In this section, we discuss how to overcome the challenges that are discussed in Section 6.1 with regards to the mapping of factorization algorithms

on the LAC. These extensions allow an architecture to perform more complex operations more efficiently. We will introduce architecture extensions that provide such improvements specifically for factorizations. However, such extensions also introduce a base overhead in all operations, since they add extra logic and cause more power and area consumption. Corresponding trade-offs will be analyzed in the results section.

Here, we focus on small problems that fit in the LAC memory. Bigger problem sizes can be blocked into smaller problems that are mainly composed of Level-3 BLAS operations (discussed in [135]) and algorithms for smaller problems discussed here. We briefly review the relevant algorithms and their micro-architecture mapping in Section 6.1. The purpose is to expose special- ized operations, utilized by these algorithms, that can be supported in hardware. We start by analyzing opportunities for extensions targeting Cholesky and LU factorization, followed by solutions to complexities in vector norm operations.

A.2.1 Cholesky Factorization

We observe that the key complexity when performing Cholesky factorization is the inverse square-root operation. If we add this ability to the core’s diagonal PEs, the LAC can perform the inner kernel of the Cholesky factorization natively. The last state of the nr× nr Cholesky factorization will save

even more cycles if a square-root function is available. The nr× nr Cholesky

However, it is a very small part of a bigger, blocked Cholesky factorization. Again, the goal here is to avoid sending data back and forth to a general purpose processor or performing this operation in emulation on the existing MAC units, which would keep the rest of the core largely idle.

A.2.2 LU Factorization with Partial Pivoting

For LU factorization with partial pivoting, PEs in the LAC must be able to compare floating-point numbers to find the pivot (S1 in Section 6.1.2). In the blocked LU factorization, we have used the left-looking algorithm, which is the most efficient variant with regards to data locality [17]. In the left-looking LU factorization, the PEs themselves are computing the temporary values that they will compare in the next iteration of the algorithm. Knowing this fact, the compare operation and its latency could be done implicitly without any extra latency and delay penalty.

The next operation that is needed for LU factorization is the reciprocal (1/x). The reciprocal of the pivot needs to be computed for scaling the elements by the pivot (S2 in Section 6.1.2). This way, we avoid multiple division operations and simply multiply all the values by the reciprocal of the pivot and scale them.

A.2.3 QR Factorization and Vector Norm

In Section 6.1.3, we showed how the vector norm operation is performed in conventional computers to avoid overflow and underflow. The extra oper-

ations that are needed to perform vector norm in a conventional fashion are the following: a floating-point comparator to find the maximum value in the vector just as in LU factorization, a reciprocal function to scale the vector by the maximum value, again just as in LU factorization, and a square-root unit to compute the length of the scaled vector just as what is needed to optimize the last iteration of a nr × nr Cholesky factorization. However, we can ob-

serve that all these extra operations are only necessary due to limitations in hardware representations of real numbers.

Consider a floating number f that, according to the IEEE floating-point standard, is represented as 1.m1×2e1, where 1 ≤ 1.m1 < 2. Lets investigate the

case of an overflow for p = f2, and as a result p = (1.m2)×2e2 = (1.m1)2×22e1,

where 1 ≤ (1.m1)2 < 4. If (1.m1)2 ≤ 2, then e2 = 2e1. But, if 2 ≤ (1.m1)2,

then 2 ≤ (1.m1)2 = 2 × 1.m2 ≤ 2 and therefore e2 = 2e1+ 1. In both cases,

a single extra exponent bit suffices for avoiding overflow and underflow in computations of the square of a floating-point number.

Still, there might be the possibility of overflow/underflow due to accumulation of big/small numbers that could be avoided by adding a second exponent bit. However, the square-root of such inner product is still out of the bounds of a standard floating-point number. Therefore, only a single ad- ditional bit suffices. Hence, what is needed is a floating-point unit that has the ability to add one exponent bit for computing the vector norm to avoid overflows and corresponding algorithm complexities.

Look-Up Tables 1/Sqrt(X) Look-Up Tables 1/X Squaring CS2D Fused Accumulation Tree CS2D V=1-RW CPA G=RH Z=G+GV Look-Up Tables 1/Sqrt(X) Look-Up Tables 1/X Squaring CS2D Fused Accumulation Tree CS2D CPA MAC Mac Input Select Logic X1 X2 _X 1 X2 Ct0 Y X Ct0 Ct1 Ct1 Ct0 Ct0 Ct1 Y X X Ct0 Multiplier Alignment Shift Exp comparison Exp Adder Accumulator Normalization Correction

Max Exp Max Mantissa

Comparator Logic EA Exp Control Shift Sign Inversion EB MA MB EC MC (b) (c) (a) Look-Up Tables 1/Sqrt(X) Look-Up Tables 1/X Squaring CS2D Fused Accumulation Tree CS2D V=1-RW CPA G=RH Z=G+GV Look-Up Tables 1/Sqrt(X) Look-Up Tables 1/X Squaring CS2D Fused Accumulation Tree CS2D CPA MAC Mac Input Select Logic X1 X2 _X 1 X2 Ct0 Y X Ct0 Ct1 Ct1 Ct0 Ct0 Ct1 Y X X Ct0 Multiplier Alignment Shift Exp comparison Exp Adder Accumulator Normalization Correction

Max Exp Max Mantissa

Comparator Logic EA Exp Control Shift Sign Inversion EB MA MB EC MC

Figure A.1: Extended reconfigurable single-cycle accumulation MAC unit [63] with addition of a comparator and extended exponent bit-width, where shaded blocks show which logic should change for exponent bit extension.

A.3 Architecture

In this section, we describe the proposed architecture for our floating- point MAC unit and the extensions made to it for matrix factorization applications. We start from a single-cycle accumulating MAC unit and explain the modifications for LU and vector norm operations. Then, we describe the extensions for reciprocal, inverse square-root, and square-root operations.

In document Algorithm/architecture codesign of low power and high performance linear algebra compute fabrics (Page 156-167)