FIFO FIFO FIFO PE Data Memory (Level 2)
HostSystem
FIFO FIFO PE PE PE PE PE PE PE PE DOMAIN WALL NOTCH LL DRIVING FORCE BRAKING FORCE Σ1 Σ2 H = Hc1+ HM1 + HNotch+ HM2+ .. .. Hc2 Electron Flow Shift Operation RWL WWL /BL BL /SWL SWL ………… WR Port RD Port Shift FIFO Control PE Control Main Controller Level 1 CMOS Host Interface O n C h ip B u s STT MRAM DWMRM PROCESSOR
Fig. 4.1.: Many-core RM processor design
In this section, we describe the many-core RM processor design using STT-MRAM and DWM. Figure 4.1 shows the architecture of the many-core RM processor. The processor is intended as a programmable accelerator for RM applications, therefore it is specialized to efficiently execute the computational kernels that dominate these workloads. The RM processor consists of a 2-dimensional array of processing elements (PEs), two arrays of FIFOs, and a data memory. The processor implements a set of vector operations that are executed by streaming the data from the horizontal and vertical FIFOs through the PE array. An on-chip bus is used to interconnect the two levels of memory, as well as interface to the host system (e.g., SoC or server) in which the RM processor resides. The host processor transfers data into the data memory, downloads the program to be executed on the RM processor, initiates execution by
writing to a special memory-mapped register, and transfers the results back into the host memory.
We next discuss the two-level memory hierarchy used in the RM processor. FIFOs, which represent the first level in the memory hierarchy, provide fast streaming access to data. The second level in the memory hierarchy is the data memory, which is of much larger size so as to store sizable parts of the data set being processed. In the baseline CMOS design, FIFOs and data memory (64KB and 2MB, respectively) repre- sent a significant portion (roughly 75%) of the total chip area. Moreover, the leakage power of these memories contribute substantially to the total energy consumption of the RM processor.
The nature of the access characteristics to the first and second level memories in the RM processor need to be considered in order to determine the choice of memory technology. The first level memory is filled by transferring data from the second level memory over the on-chip bus in large bursts. Then, data is read out to the PE array in a streaming manner (i.e., the elements are read in the order in which they were stored, one per clock cycle) for processing. In RM algorithms, it is common to require vector operations (e.g. dot product or Euclidean distance computation) between two large sets of vectors. To maximize data reuse, the vectors in one set of FIFOs are kept unchanged (e.g., support vectors in the case of SVMs), while the vectors in the other set of FIFOs are replaced (e.g., training or classification data in the case of SVM). Thus, the streaming read operation is more common than the write operation. As described earlier, DWMs are tailor-made for streaming reads of data. Therefore, we chose to use them to implement the first level memory in the RM processor.
Table 4.1.: Second level memory access characteristics for SVM and k-means
Algorithm SVM k-means
No. of memory reads (in bytes) 1.02x1010 5.6x107 No. of memory writes (in bytes) 6.09x107 4.63x105
The second level memory in the RM processor needs to be randomly accessed with a low latency (to minimize the performance impact of data transfers to the first level memory), making DWMs less suitable. The organization of the data memory should also support a wide interface (in the baseline CMOS design, a 256-bit on-chip bus is used to the connect the two levels of memory). If we consider the nature of accesses to the second level memory (shown in Table 4.1 for the SVM and k- means algorithms), we can see that the number of read operations is greater than the number of write operations by two orders of magnitude. Finally, the leakage power of the second level memory is a major contributor to the energy consumption of the RM processor. Based on these considerations, we conclude that STT-MRAM, which has very high density and low leakage power compared to traditional CMOS-based memories while preserving fast random access capability, is a good choice for the second level memory in the RM processor. The highly read-intensive nature of the memory accesses implies that the penalty of inefficient writes into the STT-MRAM is incurred quite infrequently.
Simply performing a drop-in replacement of the CMOS memories with STT- MRAM and DWM memories may lead to improvements over the baseline CMOS design, but falls far short of the goal of optimally utilizing the potential offered by spin-based memories. In order to achieve optimum benefits, we need to re-invest the area savings to increase the number of PEs and/or on-chip memory consider- ing various circuit/architecture tradeoffs involved in RM processor design. Figure 4.2 presents a qualitative summary of the impact of tuning different design parameters on area, performance and various components of energy consumption. Note that tuning a parameter may result in improvements in certain design metrics, while degrading others. We next discuss the architectural parameters and their associated tradeoffs in greater detail.
• Number of PEs/FIFOs: A parameter that has first-order impact on the per- formance and energy consumption of the RM processor is the number of PEs. The reduction in area achieved by using high density memory can be used to
No. of PEs and FIFOs
Voltage Scaling
of PEs FIFO Depth
Performance PE Dynamic Energy PE Leakage Energy Level-1 Memory Dynamic Energy Level-2 Memory Dynamic Energy Level-1 Memory Leakage Energy Level-2 Memory Leakage Energy Area
Fig. 4.2.: Impact of tuning architectural parameters on the RM processor character- istics
increase the number of PEs. In this way, we can take advantage of the inherent parallelism in the application and improve the performance of the system. Note that increase in the number of PEs should be accompanied by a corresponding increase in the number of FIFOs in our architecture (an mxn array of PEs re- quires m + n FIFOs). When we consider the impact of increasing the number PEs/FIFOs on the energy consumption, we see that there is reduction in the energy consumed by level-1 and level-2 memory, while energy consumed by the PEs increases. This can be explained as follows: (i) Improvement in perfor- mance results in reduction in leakage energy consumed by memories. When we consider the leakage energy of PEs, the leakage energy contribution from a sin- gle PE decreases due to the improvement in performance. However, the total number of PEs also increases. The performance improvement is not propor-
tional to the increase in number of PEs. Therefore, the overall leakage energy consumed by all the PEs increases. (ii) Increasing the number of PEs increases the number of computations performed per memory access, thereby reducing the number of memory accesses required for executing an application. This re- duces the dynamic energy consumed by memories. (iii) Increasing the number of PEs also increases the number of idle PEs waiting for data. This results in increased dynamic energy consumption of PEs.
• FIFO depth: A FIFO of larger depth increases data reuse, thereby improving the system performance. This improvement in performance reduces the leakage energy consumption of PEs and memory. Also due to increased data reuse, the number of write operations to the FIFOs and the number of read operations from the data memory decreases, thereby reducing the dynamic energy consumed by memories. However, the leakage energy of FIFOs would increase due to the larger FIFO size. Note that this overhead is significant in the case of CMOS- based design, but is negligible with DWM due to its inherent near-zero leakage. On the other hand, a very large FIFO would increase the energy required for shifting data in the DWM. As a result, energy required for every read/write operation would increase. Therefore, we need to use an appropriate FIFO depth considering all the above factors.
• Supply voltage of PEs: Scaling the supply voltage of PEs can be used to tradeoff energy with performance. While scaling the supply voltage of PEs leads to energy savings in the PEs, it also leads to degradation in performance. This degradation in performance causes increased leakage energy consumption from the memory. As a result, the total system energy consumption could increase if the supply voltage is scaled beyond a certain point.
Therefore, in order to design an optimal RM processor, we need to consider the complex interactions between data memory, FIFOs and PEs and perform a systematic design space exploration considering the architectural tradeoffs described above.