Problem Formulation and Solution - 5 . 1 Previous Work and Motivation

5 . 1 Previous Work and Motivation

5.2.2 Problem Formulation and Solution

The problem of finding the best bit ordering on the serial interconnect for SA can be

formulated as a graph problem. A completely connected undirected graph G(V,E,EW) is

constructed where V = {vⁱ} is the set of vertices, E-{e⁰} is the set of edges and

EW - {ew.} is the set of edge weights. The vertices correspond to the bit position in each

word. The weight denotes the probability that two bits of the parallel word are

complementary to each other. More precisely, the weight of a particular edge is denotes the probability that the values on the two nodes connected by the particular edge are

complementary to each other:

ew⁰. =Pr{[(v,. = 0)&(v,. = 1 ) ] or [(v,. =l)&(v,. = 0 ) ] } (5.1)

A particular bit ordering on the serial wire can be denoted by a path in the graph, which

starts and ends at the same vertex and visits all vertices at most once. The cost of a path,

which is equivalent to the SA on the serial wire, is simply the sum of the cost of the edges

along that path. To illustrate this, a graph is constructed for the example of Figure 5.2 (a)

and the edge weights determined by statistical analysis of the data words on the parallel bus.

The resulting graph obtained is shown in Figure 5.4.

The path with the minimum cost for the graph of the example o f Figure 5.2 is shown by dotted lines in Figure 5.4. The bit ordering obtained from our solution technique is

[bl,b2,b4,b3] which is same as obtained earlier in Figure 5.2 (b) by simple inspection.

H a m i l t o n C y c l e —i

Figure 5.4: Graph for the Example of Figure 5.2 (a)

In order to minimize the average S A , we will have to find a path in the graph which is optimal and this is equivalent to finding the minimum cost Hamilton cycle in the graph. The problem is same as the well-known Traveling Salesman Problem (TSP). Since the size of T S P to be solved is relatively small (the number of vertices is same as the number of bits to be ordered on the serial interconnect), an exact T S P algorithm can be used. The T S P solver chosen for our purpose is based on Branch and Bound [14].

5 . 3 E x p e r i m e n t a l P r o c e d u r e

In order to study the effectiveness of the N:l bit ordering scheme, we considered traces obtained from a Simplescalar P I S A process simulator [39] executing SPEC2000 integer programs. We extracted traces for the instruction cache address bus (/V=32) and instruction bus (N-32) of the processor. The traces obtained using instruction address bus represent data with uniform statistics because the consecutive addresses are sequential; whereas the traces obtained from instruction bus have non-uniform statistics.

For a chosen benchmark program, a sample of traces for instruction address bus consisting of 10 million traces was considered for analysis. Complete statistical characteristics of the sample traces was obtained by customized pre-profiling of the traces. The statistical characteristics were used to determine the edge weight of the graph. The bit ordering that minimized the S A on the serial interconnect was obtained using the T S P algorithm and the average S A was calculated for the sample traces. The percentage reduction in S A was obtained by comparing it with the case when random bit ordering was used (i.e., the bits of the parallel word were serialized based on their location). This process was repeated for the instruction bus for the chosen benchmark.

In this same way, we obtained the average percentage S A reduction for all the benchmarks for instruction address bus and instruction bus. The results are plotted in Figure 5.5. Bit ordering on the serial bus using our approach gives 55% reduction in average S A for instruction cache address bus and 30% reduction in average S A for the instruction bus on average across the various integer benchmarks. Based on the results obtained, our technique

is quite effective in reducing average S A on the serial interconnect for traces with uniform or non-uniform statistics, provided that complete statistical knowledge is known in advance.

IB restriction Address M ristruction

8 0 ,

-Benchmarks (Integer)

Figure 5.5: Percentage Reduction in Switching Activity using Bit Ordering

Further, we compare our technique with the S I L E N T scheme o f [2]. From the results in Figure 5.6, it can be seen that S I L E N T is slightly better as compared to our technique for instruction address bus. Our approach obtains 50% reduction on average, but the S I L E N T approach is in the range of 85%. This is because it is exploiting the sequentiality of the address traces which provides better reduction in the average S A compared to our bit ordering technique.

For the instruction bus, our technique is much better then S I L E N T as seen from the results in Figure 5.7. Our technique consistently reduced the average S A across various benchmarks because customized bit ordering for each individual benchmark was obtained by preliminary

profiling of the bus traces. O n average, 40% reduction is obtained. Since the traces are non-uniform, the S I L E N T scheme based on differential coding is not effective and, in fact, S A increased by 15%, and 50% in the worst case.

Benchmarks (Integer)

Figure 5.6: Comparison of the Bit Ordering Technique with S I L E N T (Instruction Address Bus)

ra Bit Ordering • SILENT

S3~

A X

ter

Benchmarks (Integer)

Figure 5.7: Comparison of the Bit Ordering Technique with S I L E N T (Instruction Bus)

5.4 Summary

On the basis of the above observations, we find that the S I L E N T scheme is preferable over our technique for address buses because there is sequentiality in the traces associated with such buses. But for the instruction bus, our bit ordering technique is a better choice for S A reduction. In fact, the ordering can be accomplished without the bus traces since the instruction set is finite. While complete statistical information would provide a better result, we believe that knowing the bit patterns in the instruction set is sufficient to obtain good results.

A s described earlier in the thesis, the practical implementation of the serialized link requires n^p bits to be serialized and transmitted on n^s lines. This implies that n^p bits have to be divided into n^s groups of N bits each, where N is given by Eqn. (4.1). The average S A reduction technique would then be applied to each group of N bits before the being serially transmitted on the interconnect. The choice of the S A reduction technique depends upon the statistical characteristics o f the traces or the parallel words to be serialized. S I L E N T is better-suited to bus with more uniform traces. Our technique based on bit ordering would be beneficial for data patterns with both uniform and non-inform statistics.

Chapter 6

In document Design of a Serialized Link for On-Chip Global Communication (Page 89-95)