Proposed SDF model - State Save Overhead Reduction Techniques for Shared Accelerators in an MPS

In this section we will present an SDF model which is an abstract version of the CSDF model proposed in Section 4.2, which also includes the delay that is a result of the sharing of the accelerators. First we will create an SDF from the CSDF model by means of abstraction. Then we will add the delay that is the result of sharing of the accelerators.

The abstraction made in the section is based on the refinement theory presented in [15].

In order to simplify the CSDF model, we will combine actorsvG,vA andvEG,

as well as the edges between these actors into one actor. We will also limit the number of phases to one, which results in a SDF model. In order to perform the abstraction, a schedule is created to visualize the firings. Figure 4.6 shows the schedule of the state-save sequence described in Section 3.3.3. As shown in Figure 4.6 the duration of a restore-save cycle depends on the number of transfers needed to restore the accelerator state, the number of data sample that are processed, the number of transfers needed to save the state and the number of commands. The total number of restore transfers isL, where li is

the number of transfers for a partial restore. The number of data samples that are processed is denoted withηs. This is also called the packet size. The total

number of save transfers will be called J, whereji is the number of transfers

for a partial save. The total number of commands isK. The firing durations in Figure 4.6 are all shown as equal, this is probably not true. However later we will over-approximate all the different firing durations into one uniform time step, and this will correspond with the uniform firing durations in Figure 4.6. This over-approximation is an valid abstraction, base on the the-earlier-the-better refinement. This refinement states that shorter firing duration can never result in a worse schedule.

vG % i1×% % i2×% % % ηs×% % % i3×% % %

vA ρCO i1×ρW ρBC i2×ρB ρCOρBC ηs×ρA ρCOρBC i3×ρB ρCOρBC j1×ρR

vEG ρCO i2×ρW ρCO ηs×%0 ρCO i3×ρW ρCO j1×%0

Figure 4.6: Typical schedule of the CSDF model

We can now derive an upper bound for the time it takes to perform a restore-save cycle. Equation (4.4) gives an upper bound on the duration of the restore-save cycle. The maximum possible firing duration of every actor is used to create an uniform time step. This time step is the worse case firing duration of every possible phase of actors vG, vA and vEG. With this uniform time step we can

create a schedule just like Figure 4.6. The number of uniform time steps needed to perform the restore-save sequence is equal to the number of transfers and the number of transfers is equal to the sum ofL,J,K andηs. Because production

by vG still needs to go through vA and vEG, two additional time steps are

necessary, which result in the +2 in Equation (4.4).

τs≤ˆτs= (L+J+K+ηs+ 2)·max(%, ρA, ρB, ρW, ρR, ρCO, ρBC, %0) (4.4)

Now we can also include the delay introduced by the sharing of the accelerator, which is equal to the sum of the processing time of a packet of each of the other streams. This results in Equation (4.2) which is the same formula used in [2]. The total firing duration can be calculated with Equation (4.3), as is done in [2].

The resulting SDF model is also equal to the model presented in [2], it only differs in the way how ˆτsis calculated, and will thus result in anotherγs. The

SDF model is shown in Figure 4.7.

vP vS vC ρP γs ρC b ηs α0 ηs b ηs d α3 d ηs 1

Figure 4.7: Proposed SDF model

The methods to determine the optimal packet size that are described in [2] can still be used. Because the topology of the proposed SDF model is equal to the topology of the SDF model in [2] andγsis still a function ofηs.

Chapter 5 Evaluation

This chapter presents an evaluation of the work that has been presented in the previous chapter. The first section will describe the test case used for this evaluation. The following section presents the performance of the implementation in terms of speed and its cost in area. The results are compared with the previous implementation. The last section evaluates the proposed dataflow models. We will compare the throughput and buffer size with the previous models.

5.1 The benchmark application

In this section we will describe the application that is used to evaluate the implementation. The same application is used as in [1]. This will allow us to compare the results of the proposed implementation with the results published in [1]. By using the same application a fair comparison can be made.

The application used in [1], is a stereo FM demodulator. This application is chosen because it can be split in 4 parts that perform similar operations on a data steam. This means the 4 parts can utilize the same accelerators, and therefore these accelerators can be shared. This makes it a suitable application to demonstrate the sharing of accelerator. Figure 5.1 shows a block diagram of the stereo FM demodulation. The radio signal is split into two streams, one for the left and one for the right audio channel. There are two steps performed for both channels. First a channel is mixed with the appropriate carrier frequency. After mixing, the signal is low-pass filtered and down sampled. The second step is FM demodulation, low-pass filtering and down sampling. So in total there are 4 parts, two steps for two channels. We will not go into detail how the different blocks work exactly. It is sufficient to know that the operations done in one part, can be performed by an FIR filter and COordinate Rotation DIgital Computer (CORDIC) accelerator. The CORDIC is used to do the mixing or

the FM demodulation. The FIR filter is used to perform the low-pass filtering and the down-sampling.

FM Demod FM Demod + - xin y_left y_right f_right f_left

Figure 5.1: Stereo FM demodulation block diagram

Figure 5.2 shows the architecture that is used to evaluate the proposed accelerator sharing architecture. There are 4 CPUs. The first is the producer (CPUP),

it will send a generated radio signal to the gateway. The second CPU is used as entry gateway (CPUGW). The CPUM is performing the subtraction and is

interleaving the two audio channels. The consumer (CPUC) is used to save

the produced audio signal. The gateway manages two accelerators, a CORDIC accelerator (ACCCOR) and an FIR accelerator (ACCFIR). There are multiple

RWIs that are used to configure the accelerators.

CPU_P CPU_GW ACC_DMA ACC_FIR ACC_EGW CPU_M

RWI RWI

Nebula Ring Network

GW EGW

ACC_COR

RWI

CPU_C

Figure 5.2: High level system overview used for evaluation

The speed of the application is measured by processing 100000 packets and by measuring the time it took to process these packets. So the producer will start a timer and sends 100000 packets to the gateway. The consumer counts the number of packets received and after 100000 packets it will stop the timer. This will result in a measurement in seconds per 100000 packets. These measurements are used to derive the least squares approximation. The measurements and approximation are transformed to 100000 packets per second by taking the multiplicative inverse. Finally we multiply with the number of samples per packet to get the 100000 samples per second. Because the number of samples per

packet, also called packet size, is variable, the results will show the throughput for multiple different packet sizes.

The size of the implementation is taken from the hardware synthesis reports. These reports list the number of used slice registers, Look-Up Tables (LUTs) and LUTs used as RAM (LUTRAMs). These are the basic components that are used to create the architecture. The report shows the statistics of the individual subsystems within the system. We include in our comparison only the subsystems that have been adapted by us.

In document State Save Overhead Reduction Techniques for Shared Accelerators in an MPSoC with a Ring NoC (Page 47-52)