• No results found

CHAPTER 4 P RELIMINARY R ESULTS

4.3 IMPACT OF LOOP AND ARRAY ACCESS ORDERS

In order to evaluate the performance of the memory access orders, we implemented a 3D 6-point stencil loop on the Convey HC-1. Figure 4.1 shows the stencil loop kernel and Figure 4.2 shows the block diagram of the design for one PE.

The order of address generation affects the memory access pattern, which in turn affects memory performance. We define two memory access orders, the loop order and the

for i=1 to x-2 do

for j=1 to y-2 do

for k=1 to z-2 do

S[i, j, k] = A[i-1, j, k] + A[i+1, j, k] + A[i, j-1, k] + A[i, j+1, k] + A[i, j, k-1] + A[i, j,k+1]; endfor

endfor endfor

array order. In the loop order, all inputs for each loop iteration are requested before proceeding to request the inputs required by next iteration. The ordering leads to nonconsecutive accesses within each loop iteration. It is not necessary to request all inputs of one iteration before requesting all inputs of the next iteration. In the array order, all

inputs belonging to each array are requested as needed by the iteration range covered by the buffer before moving to the next array. The ordering leads to non-consecutive access

Table 4.2 Memory interface stall rate using double buffering

Buffer size Interface stall rate

256 20.7% 512 9.3% 1K 6.2% 2K 10.8% 3K 16.1% 4K 20.5%

Figure 4.2 Block diagram of the 6-point 3D stencil for each PE

Convey Reorder Buffer Memory Controller Interface FIFO A FIFO B FIFO C FIFO D FIFO E FIFO F Kernel Controller and address generator

for each array at the boundary of iteration range. The size of the iteration range is also referred to as block size.

The address generator produces the memory addresses in either the loop or array order. Convey’s read reorder buffer receives data requests and returns the data to the kernel through the input FIFOs. Each of six input FIFOs corresponds to one of the six elements in the 6-point stencil. When any FIFO is almost full, the read reorder buffer will stop sending data to the FIFO. When all the FIFOs are not empty, the kernel accumulates six FIFO output data in parallel. The accumulation result is then written to the memory through the memory interface.

The design is implemented using only two PEs accessing just one memory controller. Table 4.3 shows DRAM controller efficiency is 67% for the loop order and 90% for the array order. It is expected that the efficiency will be lower when 64 PEs of four AEs run in parallel. In the loop order, we request six elements in the innermost loop of the stencil code before proceeding to request the elements of the next loop iteration. In the array order, we request an array of elements of 85 iterations before proceeding to request next array of elements of 85 iterations. It is assumed that when block size is 1 the array order reduces to the loop order. Note this specific block size of 85 is only introduced as part of the

Table 4.3 DRAM controller efficiency of 6-point 3D stencil

Request Order Loop Array

Block size 1 85

preliminary 3D 6-point stencil result. The experimental result in Chapter 6 will take block size as a parameter for studying the impact of block size on the memory performance. 4.4 DATA REUSE ANALYSIS OF 3D STENCIL

In the stencil code shown in Figure 4.1, six references to the array A are located in the innermost loop of the nested loops. The references are R0 (A[i-1, j, k]), R1(A[i+1, j, k]), R2(A[i, j-1, k]), R3(A[i, j+1, k]), R4(A[i, j, k-1]) and R5(A[i, j, k+1]). An array element accessed by one reference in an iteration may be accessed again by other references in other iterations. For example, the array element accessed by reference R3 in iteration (i, j+1, k) can be reused by reference R2 in iteration (i, j+2, k).

As introduced in Chapter 2, data reuse is realized using on-chip buffers. If the size of on-chip buffers is large enough to store all elements of array A after requesting them from the memory, then all loop iterations can access the elements from the on-chip buffers, achieving maximum data reuse. If data reuse is not enabled, then each iteration has to access six elements from the memory to produce an output element. Since the size of on- chip buffers is limited, some redundant accesses are required to request the same elements that have been requested before.

We have built a FIFO-based model to analyze data reuse rate using different sizes of on-chip buffers and stencil space. The model counts the total number of reused memory references in the 3D stencil kernel loops and then divide it by the total number of memory references without data reuse to compute data reuse rate. When the FIFO buffer cannot hold a new memory reference due to the limited size, the oldest reference in the buffer is removed to provide space for storing the new reference.

Table 4.4 and Table 4.5 show data reuse rate depends on the sizes of buffer and stencil space. The first column of the table lists the sizes of the N-cube stencil space and the first row the sizes of the buffer. The sizes of the buffer range from 16 to 4096 while the sizes of the stencil space from 53 (N=5) to 5123 (N=512). The last column of the table lists maximum data reuse rate for each N-cube stencil space when the buffer size is equal to N3.

Table 4.4 lists the results of 6-point 3D stencil. When N ≥ 256 data reuse rate stays at about 16% when the buffer size increases to 4096. This means the buffer is still too small to accommodate more references that can be reused for very large stencil space. When N ≤ 32, data reuse rate is at maximum when BS ≤ 4096. Notice a buffer as small as 16-entry deep can achieve the same data reuse rate as a much larger buffer for large 3D stencils. As shown in the last column, maximum data reuse rate increases monotonically from N = 16

Table 4.4 Data reuse rate of 6-point 3D stencil

16 32 64 128 256 512 1024 4096 N3 5 0.148 0.302 0.500 0.500 0.500 0.500 0.500 0.500 0.500 10 0.125 0.125 0.411 0.417 0.69 0.708 0.708 0.708 0.708 16 0.143 0.143 0.154 0.452 0.452 0.452 0.452 0.762 0.762 32 0.156 0.156 0.156 0.156 0.478 0.478 0.478 0.800 0.800 64 0.161 0.161 0.161 0.161 0.161 0.489 0.489 0.489 0.817 128 0.164 0.164 0.164 0.164 0.164 0.164 0.495 0.495 0.825 256 0.165 0.165 0.165 0.165 0.165 0.165 0.165 0.165 0.829 512 0.166 0.166 0.166 0.166 0.166 0.166 0.166 0.166 0.831

Table 4.5 Data reuse rate of 27-point 3D stencil

16 32 64 128 256 512 1024 4096 N3 5 0.198 0.442 0.689 0.829 0.829 0.829 0.829 0.829 0.829 10 0.259 0.579 0.583 0.826 0.826 0.928 0.928 0.928 0.928 16 0.275 0.616 0.619 0.619 0.855 0.855 0.945 0.945 0.945 32 0.286 0.643 0.644 0.644 0.644 0.874 0.874 0.955 0.955 64 0.292 0.655 0.655 0.655 0.655 0.655 0.882 0.882 0.959 128 0.294 0.661 0.661 0.661 0.661 0.661 0.661 0.885 0.961 256 0.295 0.663 0.664 0.664 0.664 0.664 0.664 0.664 0.962 512 0.296 0.665 0.665 0.665 0.665 0.665 0.665 0.665 0.963

to N = 512. The maximum rates are very close from N = 256 to N = 512.The highest data reuse rate is about 83% when N = 512.

Table 4.5 lists the results of 27-point 3D stencil. When N ≥ 256 data reuse rate stays at about 66% when the buffer size increases to 4096. Compared to 16% data reuse rate in 6- point 3D stencil, the rate is much higher, but the buffer is still too small to accommodate more references that can be reused. When N ≤ 32, data reuse rate is at maximum when BS ≤ 4096. Notice a 32-entry buffer increases more than twice data reuse rate compared to a 16-entry buffer. Maximum data reuse rate also increases monotonically. The maximum

rates are very close from N = 128 to N = 512.The highest data reuse rate is about 96%

when N = 512.

The results of using different sizes of buffer and stencil space provide an upper bound to the data reuse rate in a 3D stencil. In reality, there are many PEs on a multi-FPGA platform and each PE is assigned a block of stencil space to achieve high parallelism. The buffer size is also highly constrained by FPGA memory resources and timing requirement. As shown in Table 4.4 and 4.5, when the buffer size is 512, N = 10 has the highest data reuse rates.

Related documents