• No results found

Mathematical Models

3.2 Simulation and Modelling of NP Architectures

3.2.1 Mathematical Models

With complex SoC designs requiring a large outlay of resources and time, a method of evaluating performance for various configurations before the design process begins is sometimes required. Constructed to model steady state operation, stochastic mathematical models represent the most common methodology. In turn, these stochastic models can be separated into two sub-methodologies which are outlined below.

3.2.1.1 Queue Model

Queue-based models allow the delays, loads and probabilities within a system to be ap- proximated without a detailed simulation framework. Once a queue model has been de- veloped, analysis of the queue and service nodes within the system can be used to identify potential bottlenecks and contention. For example, consider the system outlined in Figure 3.1.

Figure 3.1: Queue Model

Each of the n PEs access a shared device such as a memory module, interface unit or hardware accelerator. Each PE hosts m threads, with each thread performing the same task. The PE array operates at fpe, while the shared device operates at fhw and takes τhw

device cycles to complete one operation. To access the hardware block, each thread issues one command to the hardware block. In a steady state, the total PE request process is the

non-deterministic sum of n Poisson processes, with an arrival rate, λ, of m * n. The deterministic service time for each command is given by Equation 3.1.

µ = fpe fhw

∗ τdev (3.1)

Following an M/D/1 queueing model [130], the total delay associated with the hardware accelerator can therefore be calculated using equation 3.2, in which ρhw is the utilisation

rate associated with the hardware block (ρ = λ µ). tqueue= ρ2 hw 2(1 − ρhw) clkpe clkhw ∗ τdev (3.2)

The queue model can therefore be examined at an abstract level for system performance, most notably for sensitivity to other system variables. For example, figure 3.2 presents the system delay as the load, access latency and hardware time is varied.

0 100 200 300 400 500 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Queue Delay (cycles)

ρhw

(A) Queue Delay Vs Device Load

0 10 20 30 40 50 60 70 80 0 4 8 12 16 20

Queue Delay (cycles)

τhw

(B) Queue Delay Vs Device Latency

ρhw τhw 0 100 200 300 400 500

(C) Queue Delay Vs Device Load Vs Device Latency

Queue Delay (cycles)

0 0.2 0.4 0.6 0.8 1 0 4 8 12 16 20

In figure 3.2(A) it can be seen that the delay associated with accessing the hardware accelerator scales exponentially as the hardware load is increased, while in figure 3.2(B) it can be seen that changes in the service time, due to additional latency, will result in a linearly increasing delay. The non-linear response associated with device load presents a trade-off during system design. Such an open queue model allows rapid development and evaluation. Within general purpose processing these queue models have been used as a mechanism for performing highly abstract analysis of multiprocessor systems [131], [132], [133] and [134]. Queue models do, however, suffer from a number of problems. Firstly, a number of assumptions regarding task complexity and inter process commu- nication are typically required [131], [132]. While Bucher and Calahan found that an open-queue model will overestimate delays by up to 10% [134], Tsuei and Vernon found that that a queue model developed specifically for an underlying architecture had an error of 9% when compared to a software model of the same architecture [133].

Wolf et al. presented an analytical performance model for an NP, with a queue model employed to represent data transfers between the PEs and system memory in [135]. The difficulty remains that, in order to extract meaningful data from the model, it was required to implement applications and algorithms on an architecture before performing any analy- sis, presupposing an underlying architecture. Furthermore, it was assumed that the system bus connected PEs and external DRAM via a cache, with no additional data generating bus or memory requests. In Chapter 2 it was seen that it is more common for an NP ar- chitecture to employ both fast, expensive SRAM-based control memory and slow, cheap DRAM-based packet memory.

3.2.1.2 Petri Net

A second mathematical framework for modelling is to use a Petri Net modelling frame- work, which provides a methodology by which a discrete system can be represented and analysed [136]. Examining transitions between concurrent systems, the Petri Net frame- work is well suited to parallel systems such as NP architectures. Within the NP domain, research in [137] examined the accuracy of Petri Net modelling when applied to the Intel

IXP architecture of NPs. It provides similar results to a cycle-accurate simulation but with a significant number of assumptions included. For example, a fixed line rate, fixed packet size and fixed memory access time are all assumed. Furthermore, it is unclear how the accuracy of a Petri Net could be improved without obtaining more accurate timing information, a process which would require applications to be implemented on the target platform.