• No results found

Automatic Accelerator Selection

5.2 PMU-Based Model

The first model is based on software metrics obtained by profiling the application on the ARM processor using the integrated Performance Monitor Unit (PMU). We shall refer to this model as the PMU-Based Model. This model also takes into account a few metrics that can be obtained through a static analysis of the application. We will use an example to motivate the intuition behind this approach.

Software runtime can act as a rough proxy for the computational complexity of an appli-cation. We use this idea to make an estimate of accelerator performance. As an experiment, we compare the runtime of a given function in software with the runtime of the same function as a hardware accelerator. We performed this experiment for all the functions in the CHStone benchmark suite [25] by measuring the software runtime of each function, and accelerating each function to a hardware accelerator with LegUp. Figure 5.1 shows a log-log scale plot of the results, with the vertical axis showing the number of instructions being executed in the main execution pipeline of the ARM processor, and the horizontal axis showing measured accelerator runtime. Each point in the figure represents a function from the CHStone benchmark suite.

Figure 5.1 shows a clear relationship between the runtime of a function in software and as an accelerator. However, the points span nearly two orders of magnitude on both axes.

For example, given a PMU count from the ARM main execution unit (event number 0x70) of around 10,000, the accelerator runtime could be anywhere from 10,000 to 1,000,000 cycles. This level of correlation is unacceptable when trying to automatically choose ‘good’ accelerators.

We can try to achieve a more accurate estimate by using a larger number of inputs to the model. The ARM performance counters in both the PMU and L2 cache can be used to collect runtime statistics on the application. We used the bare-metal profiling tool described in Chapter 4 to obtain, for each function:

102 103 104 105 106 107 Measured Accelerator Runtime

102 103 104 105 106 107

ARM Main Execution Unit Instructions

Figure 5.1: Log-log plot of ARM main execution unit instructions versus measured accelerator latency.

• counts for 18 different PMU events

• the CPU cycle count

• the number of data requests and hits in the L2 cache

Although the PMU provides approximately 60 different events, many of them are irrelevant or redundant for our single core, bare metal architecture. For example, counters related to TLB misses are not particularly useful since our target applications are quite small and will often fit within a few adjacent memory pages. Since our compiler doesn’t leverage the Cortex A9’s preload instruction, all counters related to preloading are of little use. Similarly, performance counters related to barrier instructions and exclusive memory operations are not used used we are not using the ARM processors in symmetric multiprocessing mode. Finally, counters for Jazelle and Java bytecode instructions can safely be ignored, since they will also never be used in our system. Table 5.1 shows the 18 PMU events that were collected for each of the functions in the CHStone suite.

In addition to performance counters, a static analysis of the target program can provide additional software metrics which may be useful for estimating accelerator performance. For

Chapter 5. Automatic Accelerator Selection 59 Table 5.1: List of PMU events collected for CHStone benchmarks.

Description PMU Event Number

Branch mispredicted or not predicted 0x10

Predictable branches 0x12

Linefill miss 0x50

Instruction cache dependent stall cycles 0x60 Data cache dependent stall cycles 0x61

Data eviction 0x65

No instruction dispatched 0x66

Instructions coming out of core renaming stage 0x68 Main execution unit instructions 0x70 Secondary execution unit instructions 0x71

Load/Store instructions 0x72

NEON instructions 0x74

our prediction model, the only additional software metric we use is the number of function arguments and the width of the return value. This information is important for computing the overhead associated with the accelerator, since several cycles are necessary for writing arguments and retrieving the return value or done signal. For small accelerators this static overhead is particularly important since it may contribute a large portion of the accelerator’s overall runtime.

Once all of the data was collected for each function of the designs in the CHStone bench-mark suite, weighted linear regression was used to fit this data to the measured latency of the equivalent accelerator. However, since the Cortex-A9 PMU is capable of tracking only six events at a time, we used stepwise model selection to remove the 12 PMU events which made the least contribution to model accuracy. This provided a secondary benefit of removing events that were highly correlated.

After stepwise model selection, we are left with the software metrics and PMU events shown in Table 5.2. The weights for the regression model objective function were chosen as 1/y2, where y is the measured value of accelerator latency. As suggested in [55], this weighting helps to reduce the percentage error since the residual is divided by y (the squared residual is divided by y2).

Most entries in Table 5.2 are self-explanatory; however, some warrant a brief description.

Table 5.2: Metrics and events used in the model.

Description PMU Event Number

Cycle count for SW execution

-Overhead

-L2 cache data request

-L2 cache data hit

-Predictable branches 0x12

Linefill miss 0x50

No instruction dispatched 0x66 Main execution unit instructions 0x70 Secondary execution unit instructions 0x71 Load/Store instructions executed 0x72

‘Overhead’ corresponds to the number of cycles for the reads and writes necessary to initialize the accelerator’s arguments and read back the return value. ‘Linefill miss’ (event 0x50) counts the number of times a cache linefill request misses in all processor L1 caches and is therefore sent to the L2 cache. ‘No instruction dispatched’ (event 0x66) counts the number of cycles where the ARM processor issue stage does not dispatch any instructions (the processor is idle).

The main and secondary execution units are the two parallel ALU pipelines (events 0x70 and 0x71). The remaining entries in Table 5.2 are fairly self-explanatory.

To summarize, by considering the metrics in Table 5.2, we can better estimate the number of cycles that will be used if a function is implemented as an accelerator. Recall that with a single metric (main execution unit instructions in this case), we were only able to predict accelerator performance to within two orders of magnitude (Figure 5.1). Looking ahead to the results, Figure 6.1 shows that this new model composed of the metrics in Table 5.2 that are generally well within a single order of magnitude.

5.2.1 PMU-Based Model Flow

Once the linear regression model is trained, it is fairly straightforward to integrate it into an automatic accelerator selection flow for LegUp. After acceleration candidate filtering, each re-maining acceleration candidate is profiled using the profiling framework described in Chapter 4.

Running the profiled application on the ARM core makes available all the necessary inputs to the linear regression model. Upon completion of running all candidate functions on the ARM core and the resulting profiling data through the model, performance estimates are available for all the candidate functions. Any functions which are estimated to produce a speedup when ac-celerated are integrated into a hybrid processor-accelerator system and synthesized to hardware

Chapter 5. Automatic Accelerator Selection 61 for validation.

It is important to note that selecting all functions which show potential for acceleration may not be realistic. For example, if several accelerators are generated, it is possible that the device would not accommodate all the accelerators at once. However, since all our benchmark designs are relatively small, and have few functions that provide acceleration, we did not encounter this issue. One potential solution for this situation in the real world would be to apply partial-reconfiguration to swap between accelerators at runtime.