Chapter 7. Summary and Future Work
A.4. Experimental Results and Implementations
A.4.2. Performance and Efficiency Analysis
In this part, we analyze the unblocked inner kernels of the three fac- torization algorithms. We study the performance and efficiency behavior of our extensions for these algorithms and different inner kernel problem sizes. A very important point is that even larger problems sizes are usually blocked into smaller subproblems that cast most of the operations into a combination of highly efficient level-3 BLAS operations and the complex inner kernels that we discuss here. Many accelerators only support level-3 BLAS and perform more complex kernels on the host processor. The overhead of sending the data associated with these computations back and forth is significant and affects the performance by wasting cycles. However, such issues are out of the scope of this document. What we want to show here is how effective our proposed extensions are in achieving high performance for the inner kernels compared to the baseline architecture with a micro-coded software solution.
Cholesky factorization can be blocked in a 2D fashion by breaking the problem down to a few level-3 BLAS operations and a Cholesky inner ker- nel. For our experiment, we evaluate a 4 × 4 unblocked Cholesky. We study the effects of different divide/square-root schemes on the performance of this inner kernel. The kernel performance and utilization is low because of the dependencies and the latency of the inverse square-root operation. We ob- serve (Table A.2) that the number of cycles drops by a third by switching from a software solution to hardware extensions on the LAC.
0" 5" 10" 15" 20" 25" 30" 35"
SW" Isolate" Diag" SW" Isolate" Diag" SW" Isolate" Diag"
GOPS
/W"
Three"types"of"sqrt/division"units"with"kernel"heights"64,"128,"256"
LU"No"Ext"
LU+"Comparator"
Figure A.3: The effect of hardware extensions and problem sizes on the power efficiency of LU factorization with partial pivoting inner kernel.
0" 0.5" 1" 1.5" 2" 2.5" 3"
SW" Isolate" Diag" SW" Isolate" Diag" SW" Isolate" Diag"
G O PS /m m ^2 " Three"types"of"sqrt/division"unit"with"kernel"heights"64,"128,"256" LU"No"Ext" LU+"Comparator"
Figure A.4: The effect of hardware extensions and problem sizes on the area efficiency of LU factorization with partial pivoting inner kernel.
0" 20" 40" 60" 80" 100" 120" 140" 160" 180" 200"
SW" Isolate" Diag" SW" Isolate" Diag" SW" Isolate" Diag"
G FL O PS ^2 /W " Three"types"of"sqrt/division"units""with"kernel"heights"64,"128,"256" LU"No"Ext" LU+"Comparator"
Figure A.5: The effect of hardware extensions and problem sizes on the inverse E-D metric of LU factorization with partial pivoting inner kernel.
The pivoting operation and scaling needs to be done for all rows of a given problem size. Hence, for a problem size of k × k, the inner kernel that should be implemented on the LAC is a LU factorization of a k × nr block of the orig-
inal problem. For our studies, we use problems with different k = 64, 128, 256, which are typical problem sizes that fit on the LAC. We compare the perfor- mance of a LAC with different divide/square-root unit extensions in different columns and with/without the built-in comparator to find the pivot. As we have shown in Section 6.1, the reciprocal operation and pivoting (switching the rows) can be performed concurrently in the LAC owing to the column broadcast buses. The pivoting delay is the dominating term. Hence, bigger problem sizes are not sensitive to the latency of the reciprocal unit architec- ture. However, there is a 20% speed and 15% energy improvement with the comparator added to the MAC units.
Vector norm as part of a Householder transformation only utilizes a single column of PEs for the inner product and reduce. To measure the maxi- mum achievable efficiency, we assume that there are four different vector norms completing concurrently one in each column. Note that the baseline is the orig- inal normalizing vector norm. We have three options for divide/square-root operations, and three options for MAC unit extensions. The first option is a micro-coded software solution, the second option is utilizing the comparator in the MAC unit without an exponent extension, and the last is a MAC unit with an extra exponent bit. The problem sizes are again k = 64, 128, 256 different vector lengths. As shown in Table A.2, we can observe that the exponent
extension halves the total cycles, and the divide/square-root unit saves up to 30% cycles compared to the baseline. Energy savings reach up to 60% with the exponent bit extension. By contrast, different divide/square-root units do not differ in terms of dynamic energy consumption.
We assume a clock frequency of 1GHz for the LAC. Utilization and ef- ficiency can be calculated from the number of total cycles the hardware needs to perform an operation and the number of operations in each factorization. Power efficiency for vector norm and LU are presented in Figures A.6, A.3 respectively. Figures A.7, A.4 also represent the area efficiency respectively. Another metric that we use is the inverse energy-delay. It shows how extensions reduce both latency and energy consumption. Note that for LU factorization, the pivoting operation is also taken into account. Therefore, we used GOPS instead of GFLOPS as performance metric. For LU factorization problems with k = 64, 128, 256, we estimated the corresponding total number of opera- tions to be 1560, 3096 and 6168, respectively. For the vector norm, we use the original algorithm as the baseline, which requires 257, 769 or 1025 operations per corresponding vector norm of size k = 64, 128, 256. Since our implemen- tation will result in an effective reduction in the number of actually required computations, the extensions have higher GOPS/W than what is reported as peak GFLOPS/W for the LAC in [105].
Results for LU factorization confirm that there is no improvement in efficiency with different reciprocal architectures when solving big problem sizes. Given this fact, isolated unit seems to be a better option for LU factorization.
0" 20" 40" 60" 80" 100" 120"
SW" Isolate" Diag" SW" Isolate" Diag" SW" Isolate" Diag"
G FL O PS /W " Three"types"of"sqrt/division"units""with"kernel"heights"64,"128,"256" Vnorm"No"Ext" Vnorm+" Comparator" Vnorm+"Exp"Ext"
Figure A.6: The effect of hardware extensions and problem sizes on the power efficiency of vector norm inner kernel.
0" 2" 4" 6" 8" 10" 12" 14"
SW" Isolate" Diag" SW" Isolate" Diag" SW" Isolate" Diag"
G FL O PS /m m ^2 " Three"types"of"sqrt/division"units""with"kernel"heights"64,"128,"256" Vnorm"No"Ext" Vnorm+" Comparator" Vnorm+"Exp"Ext"
Figure A.7: The effect of hardware extensions and problem sizes on the area efficiency of vector norm inner kernel.
0" 500" 1000" 1500" 2000" 2500" 3000" 3500" 4000"
SW" Isolate" Diag" SW" Isolate" Diag" SW" Isolate" Diag"
G FL O PS ^2 /W " Three"types"of"sqrt/division"units""with"kernel"heights"64,"128,"256" Vnorm"No"Ext" Vnorm+" Comparator" Vnorm+"Exp"Ext"
Figure A.8: The effect of hardware extensions and problem sizes on the inverse E-D metric of vector norm inner kernel.
By contrast, vector norm benefits from all types of extension. However, the exponent bit is what brings significant improvements in efficiency.
Since there are not many options for Cholesky, we only summarize the numbers here in the text. The number of operations in a 4 × 4 Cholesky kernel is 30. For different divide/square unit architectures (software, iso- lated, and on diagonal PEs), the achieved efficiencies are as follows: 1.95, 4.67 and 5.75 GFLOPS/W; 0.52, 4.95, and 5.15 GFLOPS2/W; and 0.03, 0.06,
0.07 GFLOPS/mm2. The reason for the very poor efficiency (less than 5
GFLOPS/W) is the small size of the kernel and limited available parallelism. Still, adding the special function unit improves efficiency around ten times, while reducing dynamic energy consumption by 75%.
A.5
Summary
In this appendix, we propose two modifications to the MAC unit designs to decrease the complexity of factorization algorithms. We also show how existing processing elements can be enhanced to perform special functions such as divide and square-root operations. To demonstrate the effectiveness of our proposed extensions, we applied them to the mapping of Cholesky, LU and QR factorizations on such an improved architecture. Results show that our extensions significantly increase efficiency and performance.
Future work includes comparison and mapping of big, tiled matrix fac- torization problems onto the LAC, including its integration into a heteroge- neous system architecture next to general-purpose CPUs and a heterogeneous
shared memory systems, which will allow comparisons between the trade-offs of complexity and flexibility.
Appendix B
Core Level Extensions for
Fast Fourier Transform
FFTs are fundamentally linked to the underlying mathematics of many areas of computational science. They are perhaps the most important single tool in “signal processing” and analysis, and play a fundamental role in indirect imaging technologies, such as synthetic aperture radar [24] and computerized tomographic imaging [67]. FFTs are a widely-used tool for the fast solution of partial differential equations, and support fast algorithms for the multipli- cation of very large integers. Unlike GEMM, the FFT has a more modest number of computations per data element (this is one of the main reasons that it is “fast”), so that performance of FFT algorithms is typically limited by the data motion requirements rather than by the arithmetic computations. For both the GEMM and FFT algorithms, application-specific designs have been proposed that promise orders of magnitude improvements in power/area efficiency relative to general-purpose processors [92, 105]. However, each of these have been isolated and dedicated design instances limited to one algo- rithm. With full-custom design increasingly becoming cost-prohibitive, there is a need for solutions that have enough flexibility to run a range of opera-
tions at the efficiency of full-custom designs. In this appendix, we analyze the similarities between algorithms and show how one might transform an opti- mized GEMM core to an FFT core. We consider whether a combined core that can perform either operation efficiently is practical, and analyze the loss in efficiency required to achieve this flexibility.
We begin by exploring FFT algorithms that may be suitable for the baseline LAC architecture. After evaluating LAC limitations and trade-offs for possible solutions, we introduce an “FFT core” that we have optimized for FFTs over a wide range of vector lengths. While optimized for performing FFTs, this core is based on a minimal set of modifications to the existing LAC architecture. We then take similarities between the original LAC and the FFT- optimized design to introduce a flexible, hybrid design that can perform both of these applications efficiently. Comparing both full-custom designs with our proposed hybrid core, we demonstrate the costs of flexibility versus efficiency.
B.1
Related Work
The literature related to fixed-point FFT hardware in the digital signal processing domain is immense. Literature reviews of hardware implementa- tions date back to 1969 [16] – only four years after the publication of the foundational Cooley-Tukey algorithm [29].
The literature related to floating-point FFT hardware is considerably more sparse, especially for double-precision implementations. Important re- cent work includes the automatic generation of hardware FFT designs from
high-level specifications [92]. These hardware designs can be used in either ASIC or FPGA implementations [27], but the published double-precision re- sults for these designs are currently limited to FPGAs [9]. Hemmert and Underwood [56] provide performance comparisons between CPU and FPGA implementations of double-precision FFTs, and include projections of antici- pated performance. Finally, a broad survey of the power, performance, and area characteristics of single-precision FFT performance on general-purpose processors, GPUs, FPGAs and ASICs is provided by Chung [27].
Performance of FFT algorithms varies dramatically across hardware platforms and software implementations, depending largely on the effort ex- pended on optimizing data motion. General-purpose, microprocessor-based systems typically deliver poor performance, even with highly optimized im- plementations, because the power-of-2 strides of the FFT algorithms inter- act badly with set-associative caches, with set-associative address translation mechanisms, and with power-of-2-banked memory subsystems.
We compare the performance, area, and power of our proposed designs with a sampling of floating-point FFT performance results on general-purpose processors, specialized computational accelerators, and GPUs.