Beyond Tigon-2: Improving Small-Frame Throughput

3.5 Evaluating Modified Architectures

3.5.2 Beyond Tigon-2: Improving Small-Frame Throughput

A noteworthy result not shown in Figure 3.6 is that increasing memory bandwidth also increases frame throughput for traces of minimum-sized (18-byte datagram) frames, where presumably the only bottleneck is processing capacity. Includ- ing all frame headers, the memory throughput delivered for frame contents is only 0.164 Gb/s, while the realizable peak throughput is 5.12 Gb/s. However, the frame throughput for Tigon-2 with 250 MHz SRAM is 24% more than that of the original Tigon-2 Spinach model, which serves as a baseline configuration. Spinach enables the investigation of such non-intuitive results by giving the designer the ability to modify the system in ways that may isolate the source of the performance difference.

As frame rates increase with smaller frame sizes, contention for global memory among the processors and hardware assists increases. Processors must consume and inspect more transaction (DMA and MAC) and buffer descriptors per second, and assists must also process these descriptors at a higher rate. All of these descriptors are communicated through global shared SRAM. Increasing the bandwidth to this

SRAM increases the throughput of these descriptors and decreases the average latency to access these descriptors. With the original Tigon-2 architecture, the processor cores experience higher latencies when accessing such descriptors. This problem is exacerbated by the fact that the processor cores and assists do not feature any memory latency tolerance mechanisms, such as data caching. Furthermore, processors have the lowest memory arbitration priority among all requesting units.

Further modification of the Tigon-2 architecture using Spinach emphasizes this performance issue and gives information about the tradeoffs involved. One simple modification takes the baseline configuration and uses round-robin arbitration for memory accesses. Changing the arbitration parameter to the main memory arbiter fa- cilitates this modification. Using round-robin arbitration reduces the single-reference latency seen by the processors and assists. In an effort to reduce single-reference latencies, round-robin arbitration aggressively gives allocation to the next-arriving request and may terminate bursts. Though processor latency decreases, round-robin arbitration can reduce memory throughput significantly.

A more drastic modification is to segment the global memory space and isolate buffer and transaction descriptors from main memory. That is, all buffer and transaction descriptors are placed in a separate memory structure and accessed by a separate memory bus. The total amount of memory used by buffer and transaction descriptors is 92 KB, which could reasonably fit in one SRAM.

M a in M e m o r y M e m o r y C o n tr o lle r M e m o ry M a p p e d R e g is te rs M e m o r y A r b ite r D M A A s s is t D M A A s s is t P r o c e s s o r C o r e # 1 P r o c e s s o r C o r e # N S c ra tch p a d F ilte r S cra tch p a d F ilte r C a ch e C o n tro lle r C a ch e C on tro lle r C a ch e M e m o ry S cra tc h pa d M e m o ry C a ch e M em o ry S cra tch p a d M em o ry M e m o ry A rb ite r P C I M A C A s s is t M A C A s s is t P H Y S cra tch p a d F ilte r S c ra tch p a d F ilte r M e m o ry A rb ite r D e s c rip to r M e m o ry

Figure 3.7 The Tigon-2 architecture with separate descriptor memory.

In order to be able to provide descriptors to all the devices that need them, both the processors and assists must have access to this descriptor memory bus. Figure 3.7 depicts an implementation of this architecture using Spinach modules. The new bus and memory are shown in the upper right-hand side of the figure. The figure shows separate connection lines for each requesting device to represent one bus with several

devices requesting arbitration. Each processor core uses another scratch pad filter module instance to check request addresses to see if requests should be satisfied on the descriptor memory bus or by the higher levels of the memory hierarchy. Each assist has a new group of ports added to it to be able to make concurrent references to the descriptor memory bus. The address range of descriptors in the global address space is user configurable.

88 200 0 0.5 1 1.5 2 2.5 3 3.5

Processor Core Speed (MHz)

Speedup of UDP Throughput (18−byte Datagrams)

Base Tigon Base + RR Desc. Mem Desc. + RR

(a) Throughput Speedup

88 200 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

Processor Core Speed (MHz)

Processor Memory Stalls per Instruction (18−byte Datagram Trace)

Base Tigon Base + RR Desc. Mem Desc. + RR

(b) Processor Memory Stalls

Figure 3.8 Evaluating architecture modifications to increase small-frame performance.

Figure 3.8 evaluates the effectiveness of various NIC architecture modifications with respect to minimum-sized frame performance. All speedups in Figure 3.8 are with respect to the base Tigon-2 architecture using 88 MHz processors. The modifications evaluated include using round-robin memory arbitration (labeled “Base +

RR”), using dedicated memory for descriptors (labeled “Desc. Mem”), and a combi- nation of the two (labeled “Desc. + RR”). The baseline Tigon-2 architecture (“Base Tigon”) is presented as a reference. In order to show the effect of increased computa- tion resources with these new memory organizations, all configurations are evaluated with both 88 MHz and 200 MHz processor cores. For all configurations, the memory bus speed remains constant at 100 MHz.

Figure 3.8(a) shows that using the descriptor memory bus significantly improves small-frame throughput for both the 88 MHz and 200 MHz cases. Specifically, separate descriptor memories provide 47.5% and 34.0% throughput improvements, re- spectively. Figure 3.8(b) shows that adding the descriptor memory bus dramatically reduces the number of memory stalls per instruction, a change that provides increased processor core efficiency and increased frame-processing throughput.

Though the descriptor memory bus affects processor-memory latencies, it also decreases latencies experienced by the assist units. Thus, even though the overall instruction throughput increases by only 11.5% for the 88 MHz-“Desc. Mem” case, reduced assist latencies help improve overall frame throughput by the aforementioned 47.5%. This confirms that the base Tigon-2 configuration is not compute bound with minimum-sized frames. It confirms that assist performance with minimum- sized frames affects overall performance, since the instruction throughput increase alone does not account for the total frame throughput increase. Assists rely heavily

on descriptor throughput to complete per-transaction overheads such as fetching or storing transaction descriptors.

Though the 200 MHz-“Desc. Mem” case experiences a 25.7% increase in instruction throughput, the overall frame throughput increase of 34.0% over the base 200 MHz case is less than the increase of the 88 MHz-“Desc. Mem” configuration. As frame rates and overall accesses to descriptor memory increase, contention for descriptor memory increases and the speedup per descriptor access decreases. Thus, the per-transaction speedup contribution of the descriptor memory bus decreases with increased frame rates.

Adding round-robin arbitration does not usually increase throughput performance. The increased memory demands in the presence of decreased SRAM burst efficiency explains this relative decrease. As frame rates increase, main memory utilization increases. However, using round-robin arbitration reduces SRAM efficiency by more than 20% for all cases. This effect is particularly noticeable for the 200 MHz “Base + RR” case. Frame rate demands increase enough using 200 MHz cores that the 30% reduction in SRAM efficiency overwhelms any potential latency improvements, and round-robin arbitration decreases performance relative to the baseline 200 MHz case. In fact, the efficiency degradation combined with increased demands actually increases processor-memory latency for both 200 MHz round-robin cases, as shown in Figure 3.8(b).

In document RICE UNIVERSITY. Simulation-Driven Design of High-Performance Programmable Network Interface Cards by Paul Willmann (Page 73-79)