F ORMAL MODEL - COL ADDRESS - Real-Time Prefetching on Shared-Memory Multi-Core Systems

COL ADDRESS

IV. F ORMAL MODEL

In this section, we formally show how to provide composable service based on _{LR servers by deriving and enforcing} temporal bounds. We start by explaining how service curves are used to model the interaction between the requestors and the resource in Section IV-A. This allows us to define composable service. We then proceed in Section IV-B by defining a LR server and showing that they can provide composable service according to our definition.

Throughout this paper, we use capital letters (A) to denote sets, hats to denote upper bounds (ˆa), and checks to denote lower bounds (ˇa). Subscripts are used to disambiguate between variables belonging to different requestors, although for clarity these subscripts are omitted when they are not required. To deal with different resources in a uniform way, we adopt an abstract resource view, where a service unit corresponds to the access granularity of the resource. Time is discrete and a time unit, referred to as a service cycle, is defined as the time required to serve such a service unit. The translation from service cycles to clock cycles is solved by multiplying the number of service cycles with the maximum service cycle length, which is known and bounded for a predictable resource.

A. Service curves

We use cumulative service curves to model the interaction between the resource and the requestors. We letξ(t) denote the value of a service curveξ at service cycle t. We furthermore useξ(τ, t) = ξ(t + 1)_{−ξ(τ) to denote the difference in values} between the endpoints of the closed interval[τ, t].

549

Figure 2.7: Graphical representation of the service provided by an LR arbiter [8].

Many of these arbiters fall into the class of Latency-Rate (LR) arbiters [55]. An arbiter in this class can be characterised by two things: it’s latency, or the maximal waiting time in which a set of requests may be blocked, and its rate, or the average rate at which service will be provided to a shared resource. A graphical model of this can be seen in Figure2.7. Here, the red line denotes the service requested

through the arbiter, and the blue line the provided service. The arbiter will provide an average rate of service ρ after a delay Θ. The system is then schedulable assuming that the sum of all rates does not sum to more than the available bandwidth, i.e. P_r_∈Rρ_r <= 1, where R is the set of all requesters. In this case, the provided service will never fall below the minimum service level (the dotted line) while there is still some work outstanding.

Being able to derive the latency and rate of a memory transaction in this way is good for system analysis. Without such an arbiter, the system as a whole must be analysed at once, and the behaviour of each task with regards to shared memory must be accounted for together which is difficult and time consuming. By splitting the available system bandwidth in this way, each task only needs to be analysed against its assigned partition of the bandwidth, hence simplifying the analysis. Furthermore, this simplifies task switching; given each task has assigned bandwidth bounds, a new task can be scheduled on a processor as long as the amount of bandwidth it requires does not exceed the remaining “spare” bandwidth in the system. The combination of tasks with their own partitions like this is typically called “composable” system construction [56].

Of course, the behaviour of an arbiter depends upon the arbitration scheme in use. Some of the most commonly used schemes are detailed throughout the remainder of this section.

Static Priority

The most basic arbitration scheme to use is static priority. Within this scheme, each input port on the arbiter is assigned a static priority, and on each scheduling interval (i.e. the period of time at which the arbiter will schedule a new request), the request which originated from the highest priority input port is given service. If the set of running tasks is known ahead of time, along with the exact memory

locations they will access, when they will access them and exactly how long each access will take, this approach is one of the simplest to analyse and does not require much overhead in hardware to keep track of state.

Of course though, static priority does not allow for any dynamism in the system; it is difficult to change the set of running tasks in a safe manner (unless all possible sets of running tasks have been analysed ahead of time), and nor can static priority deal with much deviation from what was expected; if a memory access could not take place when the analysed schedule assumed that it could, then it may cause the schedule to shift by one or more time periods and hence may cause tasks to miss their deadlines. Moreover, static priority is not safe; if a task initiates more memory accesses to what was expected, either through a bug in the task or an error in the system analysis, it can negatively affect other tasks and again cause them to miss their deadlines. In the worst case, it is possible for a task in a static- priority system to flood the memory controller with reqeusts and effectively starve all other tasks.

It is possible for statically scheduled systems to undergo “mode changes” [57] in order to change the priorities of tasks or the set of running tasks to add some form of dynamism to the system. Ultimately though, this simply assigns a set of tasks to a “mode”, then analyses the interactions of each, allowing the system to jump between a few pre-determined set of tasks.

This inability to deal with unexpected conditions makes static priority a poor fit for a system which also uses prefetching; if the prefetcher initiates a prefetch at a poor time, it may cause excess interference in the system and cause tasks to miss their deadlines, as the system analysis did not take this into account. Even if the prefetch is safe, it may displace useful data out of the processor’s cache and again, cause it to perform an unexpected fetch, causing excess interference and cause tasks to miss their deadlines. While not good on its own, static priority can be very good when combined with other arbitration schemes; both CCSP and FBSP use static priority to determine an ordering when there are multiple requests which can be scheduled in the same scheduling interval. This, combined with another arbitration scheme, allows some flexibility within the system as high- priority requesters can issue a request without experiencing much latency, whereas requesters which can deal with additional latency can be assigned a lower static priority while still guaranteeing an upper bound on the blocking they will experi- ence.

Round-Robin/Time Division Multiplexing

Arguably the most basic fair arbitration scheme is round-robin arbitration [58]. This scheme simply cycles through the input ports to the arbiter in sequence, ac- cessing the request from the next slot on each scheduling interval. Because of the simplicity of this arbiter, it is trivial to derive latency-rate constraints for it; as-

suming a scheduling interval of tschedand N inputs ports, the latency is the time taken if a request has “just missed” its interval, that is, the time taken to elapse the window it just missed, and the windows of all other requestors, hence can be derived as (tsched× N) − 1. The rate is equally simple to derive, since bandwidth is equally shared between requestors, and is hence _N1.

1 2 2 3 4 4

tmax

3 4 2

Cell Pointer

Figure 2.8: Example 9-slot TDM schedule for four requestors.

Of course, round-robin assumes that all requestors will require the same rate of service. This is relaxed when using time-division multiplexing instead. This method uses a table, where each slot denotes which requestor is given service on each cycle, of which an example can be seen in Figure2.8. On each scheduling

interval, the arbiter reads which requestor should be given service from the current table cell, then increments the cell pointer. This can then be used to accomodate the case where uniform bandwidth is not required, but some requestors may require a slightly high bandwidth share than others.

While slightly more complicated, this arbiter does still fit into the framework for anLR arbiter. Here, it must be assumed that the request just missed its slot again, and that it will have to wait the maximal amount of time possible until it can be scheduled again. This behaviour depends upon the current TDM schedule; for the schedule in Figure2.8, the maximal waiting for requestor number 2 is t_max, or in

this case, (7× tsched) − 1. The average rate is then the proportion of slots which a requestor has from the whole schedule, or rather, nSlots_N .

Frame-Based Static Priority

So far, all of the arbitration schemes suffer in the fact that they do not allow for any jitter in the release of requests coming from a requestor; if they miss the start of their window by a cycle, they must wait the maximal delay for their time slot to appear again. Frame-based static priority attempts to fix this by moving to a model whereby a requestor is permitted to make a number of requests within a given time window [59]. Of course, there still needs to be a mechanism to arbitrate which requestor in the set of requestors which can make a request can have service, which is done through static priority.

At least for memory arbitration, this is advantageous over systems such as TDM because it does not require the task to always have the same access pattern; it may be “burstier” over some windows, or may be constant over others. Instead, it only enforces an average rate. On the other hand, it may cause a great amount of slack to accumulate for low-priority requestors; if the high-priority requestors are not utilising their entire bound, lower-priority requestors will be able to gain service much faster than their estimated latency. Moreover, the hardware overhead of such an arbiter is larger than the table-based systems.

The latency and rate of the arbiter can be derived as follows: the latency is the worst-case situation where all higher priority requestors must consume their entire bound. If ni is the number of requests that a requestor i ∈ R may have serviced over the frame interval tframe and hp(i) is the set of requesters with a higher priority than i, the worst-case latency Θiis defined as follows:

Θ_i= X

k∈hp(i)

n_k_{× t}_sched

The rate is then the proportion of the total number of requests which can be serviced in the frame time, hence:

ρi=

ni× tsched t_frame

Credit-Controlled Static Priority

While Frame-Based Static Priority goes some way to allow for non-uniform access patterns, it still imposes scheduling within a given frame time. Credit-Controlled Static Priority (CCSP) instead attempts to emulate the the characteristics of an LR server, rather than fitting an LR server model to the arbiter [9].

Accumulated service Time [cycles]

τ

ˆ

w

′

ˇ

w

′

σ

′

τ

π(τ

)

Figure 5. The upper bound on provided ser-

In document Real-Time Prefetching on Shared-Memory Multi-Core Systems (Page 50-53)