Measuring Performance - Architecture Model Behavior

3.4 Architecture Model Behavior

3.4.6 Measuring Performance

Our architecture model is abstract and non-functional. The reason is simple - during the ex- ploration process we are interested in performance numbers, not in produced data. Therefore, the application functionality is modeled by a set of abstract architecture instructions, each of which corresponds to a particular computation delay. Typically, application processes ex- change tokens (data types) which must be translated into architecture data types, e.g., bits, bytes, words, double words, etc. As a consequence, a single application SI may have to be translated in a sequence of architecture SIs. Hence, the first step is to define an abstract instruction set.

Abstract Instruction Set. An abstract instruction set (AIS) is a pairhF, Siused to specify symbolic instructions, whereF is a set of functions, the execution of which the architecture model needs to simulate andS is a set with specifications of processor and network word sizes.⋄

Apart from architecture SIs, the architecture model instance executes various communication related delays. They can either be fixed assigned delays or implicitly generated delays. The former are given as architecture structure component parameters. For example, an execute SI must be specified in terms of a fixed delay. The latter results from the model instance execution (simulation) and various time behaviors of the components. For example, a delay

that appears as a consequence of blocking on either memory or data availability falls into this category.

We use fixed delays for scheduling, process migration, context switching, interrupt, communication connection switching, a single bus word transfer, a communication setup and both data/room checking and signaling. Some of these may be ignored or are not applicable for some platform instances (e.g., hardware accelerators and co-processors do not need scheduling, context switching or other multi-programming functionalities).

CFSM Performance Measurement

The measuring mechanism of performance numbers is closely related to a dynamic TLM model of the component CFSMs and the inter-thread communication channels. We collect and accumulate both the explicit (fixed) and the implicit delays by summing and subtracting the end-to-end time differences for each architecture functionality deemed to be a delay. This produces a ’running’ time of an individual CFSM in a certain state. The CFSM running time is a sum of running times in all its states.

For example, looking at Figure A.9 from Appendix A, the execution of a singlereadsym- bolic instruction by an RU thread inside of a processor component results in the unrolled sequence of FSM states: IDLE7→SET U P 7→ST ALL7→RU N 7→IDLE, where ”7→” defines the total order between the states from left to right. The delay of theSET U P state can be assigned by means of user-configuration, and the delay of theST ALLstate is implicit due to the fact that it depends on the conditional synchronization with a separate component (a read router-interface component connected to the processor component). Finally, depend- ing on the user configuration, theRU N state can contribute to both the explicit delay (so called ”budget” of areadSI) and the implicit delay (storing data coming from the outside in the specific internal readoperand FIFO). Each time a readSI arrives, all states are affected according to (1) the assigned delay parameters and (2) implicitly generated delays due to conditional synchronization. As a result of these delays the RU CFSM running time is altered, the other CFSMs interfacing the RU module are also altered and finally, the total system simulated time is altered.

Equations 3.1 and 3.2 more formally express the measures;TS stands for the running time

of state S,delayiexpresses a fixed-parameter delay (i indexes through all delays of state S),

update accounts for a collection of implicit delays caused by condition-synchronization (j

indexes through all updates of state S) andTM stands for the running time of the module M

with statesSk(k indexes through all states of the module M).

TS= X i delayi+ X j update(j) (3.1) TM = X k TSk (3.2)

Component Performance Measurement

However, running time must be calculated differently for a component than for a module since modules may run concurrently. The running time of the component cannot be derived by a simple sum of all running times. Rather, we look for an end-to-end delay because it gives us the running time of the component. The start time is found as a minimum of start-up time-stamps of all modules within the component. Each module39acquires this start-up time at the start of the execution. The end time is found as a maximum of stop time-stamps of all modules within the component. Each module acquires this stop time when it is blocked and there are no inputs available40_.

For example, looking into the processor with compile-time pipelining of symbolic instructions (see Section 3.5.1), each of its CFSM modules, PU, FECTRL, BECTRL and the sets of RUs, WUs and EUs, have their specific start-up time-stamps and stop time-stamps. The processor component running time is determined by a difference between the highest stop and the lowest start-up time-stamp values. If an RU module has a lowest start-up time-stamp and the PU module has the highest stop time-stamp, then these two time-stamps determine the processor component running time.

Equation 3.3 expresses more formally this end-to-end measure; TB stands for the running

time of the component B (which may be a processor or a interface),max(S i

TEi)represents the end time of the component B (i indexes through all modules of B component,TEirefers to the stop time stamp of the module i and max extracts the maximal value) andmin(S

TOi) represents the start time of the component B (i indexes through all modules of B component, TOi refers to the start time stamp of the module i and min extracts the minimal value).

TB=max( [ i TEi)−min( [ i TOi) (3.3)

The running time of the whole architecture is calculated in the following way: we look for the maximum end time of all components41and the minimum start time of all components42 and we define the difference between this maximum and minimum as the running time of the architecture W (i.e.,TW in Equation 3.4).

TW =max( [ i T_BE_i)−min( [ i T_BO_i) (3.4)

39_{A module is described by a single CFSM.} 40_{This phenomena we named the artificial deadlock.} 41_T

iin Equation 3.4, where i indexes through all components in the architecture and E stands for an end time

stamp.

42_T

BO_iin Equation 3.4, where i indexes through all components in the architecture and O stands for a start time

Architecture Timing Model

Based on the CFSM timing model and the component timing model, we have the following definition of the timing model.

Timing Model. The timing model of the architecture model instance is defined as the 3-tuple

hA, D, Tiwhere,Ais the abstract instruction set for that instance,D is a set of assigned delays for each state of a CFSM, for each CFSM in a component, for each component in the architecture andT is a set of calculated performance measurements (TS,TM,TB,TW)

obtained during the system simulation.⋄

It is worth noting thatAandD(abstract instruction set and assigned delays) are established through a mapping prepossessing phase called calibration. This is not a trivial task at all, and due to that, it is probably impossible to automate. The calibration has a major impact on accuracy of an architecture model instance, since it is driving the configuration of that instance. There is some research work done in the area of calibration of DSE [74], but none of it’s results can be acquired ”as is” in our architecture model. An exemplification of calibration issues is given in the next chapter, Section 4.5.3.

In document Execution platform modeling for system-level architecture performance analysis (Page 77-80)