Performance of the Software Agent - TimeTrial Performance: Overhead and Impact

4.7 TimeTrial Performance: Overhead and Impact

4.7.2 Performance of the Software Agent

To assess the overhead and impact that the software agent has on the software portion of an application, a micro-benchmark was developed in the form of a chained Auto-Pipe ap-plication. In this benchmark application, the Src block generates data, forwards it through a chain of 9 blocks, and the Sink block discards the data. The purpose of this applica-tion is to benchmark TimeTrial’s ability to measure multiple edges that communicate very rapidly. Figure 4.5 shows the topology of the application. Each block gets mapped to a single processor core. The application uses 11 cores of a 12-core AMD Opteron machine.

The 12th core is reserved for the TimeTrial SW agent, shown as TTA in the diagram. Each edge is tapped (shown as dashed lines in Figure 4.5) and two different measurements were calculated on all 10 taps.

Figure 4.5: Software agent micro-benchmark application. The source generates data and each block consumes and forwards the data as fast as it is able. Each block has its affinity

set to a unique processor core.

To measure both the impact and the overhead, the micro-benchmark was run for 30 iter-ations with two different transfer sizes. Each transfer is a bulk transfer of either a 2048 element 8-byte array or an 8192 element 8-byte array. These sizes were chosen since they are efficient array sizes for transferring data in the Auto-Pipe system. The frame size was set to 1 second. Four experiments were performed for each. The first was a baseline with

no instrumentation added. The second measured the mean rate on each edge. The third calculated a histogram of the queue occupancy on each edge. The final experiment cal-culates both the rate and the occupancy (20 total measurements). Figure 4.6 shows the overhead measurement results with respect to utilization of the software TimeTrial agent.

For the array size of 2048, just measuring the rate is a relatively low burden on the agent.

Gathering a histogram of queue occupancy is a much more compute intensive task. Both measurements push the utilization up around 70%. For an array size of 8192, the utiliza-tion is much lower, topping out around 20%. We consider these acceptable overheads for detailed measurements.

None Rate Occ Occ+Rate

0 10 20 30 40 50 60 70 80 90 100

2048 8192

% Utilization

Figure 4.6: Overhead of the measurements for the software agent measured by utilization of one processor core for two array transfer sizes. Error bars show one standard deviation.

To measure the impact, the same set of experiments were run and the throughput was logged for both transfer sizes. The results are shown in Figure 4.7. For a transfer size of 8192, there is almost no impact on the application performance. The smaller transfer size does have some impact, a maximum 3.7% reduction in throughput when measuring both rates and occupancy of all 10 edges in the application.

4.8 Chapter Summary

In this chapter we described ways that TimeTrial is able to measure the performance of a distributed application with low impact. This is accomplished by deploying agents on each resource to measure the application and aggregation of event streams to performance metrics online.

None Rate Occ Occ+Rate 0

100 200 300 400 500 600 700

2048 8192

MB/sec

Figure 4.7: Impact on the throughput of the micro-benchmark by the software agent.

Error bars show one standard deviation.

Taps were described as a way to gather information about the performance. The TimeTrial language was integrated with the Auto-Pipe compiler to enable automated insertion of taps and connection of these taps to the measurements agents. The software agent is responsible for collecting results from all the FPGA agents and combining them into a profile per frame.

Both the software and FPGA agents are able to measure substantial applications with very little impact on the execution. Dedicating resources is essential the measurement task.

Chapter 5 Monitoring Virtual Queues

In this chapter, we describe TimeTrial’s approach to measuring virtual queues, while at-tempting to preserve the low-impact nature of the performance monitoring. This approach involves using a simple discrete event simulation model for measurements on a single re-source. Then, a more sophisticated model is presented that handles crossing a system I/O bus. Next, we show example measurements of virtual queues using our approach and com-pare them to ground truth (precise knowledge of the quantity being measured) collected from a micro-benchmark application. Finally, we discuss the circumstances where the ap-proach described here is inappropriate, and how users of TimeTrial might understand when the approach is or is not applicable.

The previous chapter described the approach to automated monitoring of regular queues, that is communication channels that begin and end within a single compute resource. More challenges arise when the communication channel crosses resource boundaries, such as the link from an FPGA to the a processor. In this case, a portion of the queue is on the FPGA, a portion of the queue is comprised of buffers associated with whatever low-level communication mechanism(s) are used to move data between the FPGA and the processor (e.g., a PCIe bus or something similar), and a portion of the queue is in the processor’s memory.

We refer to the queues on such edges as virtual queues. Measuring the occupancy of virtual queues is not simply a matter of instrumenting the enqueue and dequeue operations. To perform the aggregation function(s), both enqueue and dequeue events must be known on a common platform, which necessitates moving either the enqueue events from the FPGA to the processor or the dequeue events from the processor to the FPGA. Neither of these options is attractive, since a large number of events implies a large volume of performance meta-data must share the processor-FPGA interconnect (such as the PCIe bus example above).

As stated above, the direct measurement of communication channels (and their associated queues) that cross platform boundaries is incompatible with the notion of low-impact mon-itoring. While some metrics of interest, such as rate, can be effectively measured at one end of the channel or the other, other metrics, such as queue occupancy, require information from both the head and the tail of the queue.

TimeTrial’s approach to virtual queue monitoring is to instrument what it can and use a performance model of the underlying system to infer what it cannot directly measure, estimating the information that is missing. As such, it is important not only to provide the performance quantities that were estimated, but also to provide guidance as to the quality of the estimates.

Here we focus on querying the occupancy of virtual queues. It is anticipated that the same techniques will be similarly effective for other metrics that require detailed event information across platform boundaries (e.g., latency through virtual queues); however, this is left for confirmation in future work. It is clearly true for some aggregated metrics such as mean latency, which is directly related to mean occupancy via Little’s Law[74].

As defined here, virtual queues are comprised of several constituent components, all chained together to comprise the communications channel between two compute blocks. For a channel that moves data between a multicore processor and an FPGA, the components will include buffers in user space on the processor, kernel buffers on the processor, a physical bus that transfers the data to the FPGA (e.g., PCIe), buffers in the DMA engine on the FPGA card, and application buffers deployed (by the Auto-Pipe runtime system) on the FPGA.

While many of these constituent components of the communications channel are opaque (i.e., they cannot be directly monitored by TimeTrial), the two components at either end of the chain (comprising the head of the queue and the tail of the queue) are visible to TimeTrial and can be therefore be monitored. Whenever a user requests the occupancy of a virtual queue, TimeTrial deploys monitors for the head and tail sub-queues for which it has visibility.

In document Low-Impact Profiling of Streaming, Heterogeneous Applications (Page 71-75)