A significant contribution is an assembler and a software simulator for the SDLP. The purpose of these tools is to:
1. Verify that the concepts described are practical with regards to implementation; 2. Provide a reference model for further architectural exploration and experimentation; 3. Provide a reference model to aid the development of an FPGA implementation; 4. Allow various statistics to be gathered when executing benchmark code.
The assembler is written in simple object-based C++. The simulator is written accordingly in C++ as opposed to using a simulation framework such as SystemC [58]. This means that most software and hardware engineers will be able to quickly understand and modify the simulator without specialist knowledge and experience of frameworks. The execution statistics that can be gathered include the following:
• Instruction Memory Reads; • Data Memory Reads; • Data Memory Writes;
• Number of byte literals (constant values used in an expression); • Number of null terminators (in instructions and condition sequences); • Clock cycles.
It is not possible to be be completely accurate regarding clock cycles, as this will ultimately depend on the RTL (Register Transfer Language) implementation details. This work must be done in successive refinements and include the skills of a digital electronics engineer working alongside the processor designer. However, at this stage of development, sensible estimates for clock cycles are adequate for the purpose of architecture exploration and ascertaining the viability of the SDLP.
As discussed previously, the simulator does not currently model the pipeline behaviour. Instead the fetch, decode and execute cycles are considered sequential; there is no overlapping. Modifying the simulator for dynamic pipeline modelling should be considered for future work.
For the current sequential execution model, the pseudocode listed above is used to determine the number of clock cycles required for each fetch, decode and instruction execution. In particular, the execute phases of each instruction can be defined.
A while instruction requires the following number of clock cycles to execute: • If the condition is false, 2;
• If the condition is true, 2;
• If the condition is true, a further cycle is required for the null processing (see Listing 6.9). An if instruction requires 2 clock cycles to execute regardless of the condition outcome. An if-else instruction requires the following number of clock cycles to execute:
• If the condition is false, 2; • If the condition is true, 2;
• If the condition is true, a further cycle is required for the null processing (for jumping over the else block when the end of the if block has been reached, see Listing 6.9).
Instructions are just one consideration for clock cycle simulation; another is the execution of expressions. The clock cycles for expressions must account for the following:
• Cycles for rvalue addressing; • Cycles for lvalue addressing; • Cycles for node processing. rvalue addressing can be:
• Literals - part of the instruction stream and already decoded; • Address of - part of the instruction stream and already decoded; • Variable - a read from the Data Memory System is required;
• Pointer Dereference - 2 reads from the Data Memory System are required. lvalue addressing can be:
• Ignore - no value is written to the Data Memory System, only internal processor flags are updated;
• Variable - a write to the Data Memory System is required;
• Pointer Dereference - 1 read and 1 write to the Data Memory System are required.
Arithmetic and logical operations performed by nodes within the Expression Engine are likely to be implemented using similar techniques that traditional ALUs employ. Therefore, the clock cycle values for these operations can use the values taken from the data sheet of an existing pro- cessor. The cycles times for node processing could be taken from the data sheet for the MicroBlaze Processor Reference Guide [5]. The MicroBlaze is a soft-core processor, intended for use on plat- forms with an FPGA. Since the next step in the development of the SDLP may be an FPGA rather than an ASIC (Application Specific Integrated Circuit) implementation, the MicroBlaze appears appropriate.
Table 6.2 illustrates the cycle values for an appropriate selection of MicroBlaze instructions. These are the figures for when area optimisation is enabled.
Instruction
Number of clock cycles ALU
and, or, xor 1
add 1
cmp 1
bs (barrel shift) 2
mul 3
Load/Store
imm (load immediate) 2
lw (load word) 2
Branch
br 3
beq 3
Table 6.2: Cycle Times for MicroBlaze Soft-core Processor, adapted from [5]
If the clock cycle values for the MicroBlaze are to be used to derive values for the SDLP simulator it is important that they are reasonable and within range of what can be considered typical. To ensure this, clock cycle values for the ARM7TDMI processor were also considered alongside the MicroBlaze values. The ARM7TDMI core is a popular 32-bit embedded RISC processor for embedded systems requiring low power consumption, small size and high performance. The processor is based on the Von Neumann architecture and has a three-stage pipeline comprising of fetch, decode and execute. The data sheet for the ARM7TDMI [6] details the number of cycles required for different types of instructions, and these are shown in Table 6.3. However, these must
Instruction Cycle Count Additional
Data Processing 1S + 1l for SHIFT(Rs)
+ 1S+1N if R15 written MSR, MRS 1S - LDR 1S+1N+1l + 1S+1N if R15 loaded STR 2N - LDM nS+1N+1l + 1S+1N if R15 loaded STM (n-1)S+2N - SWP 1S+2N+1l - B, BL 2S+1N - SWI 2S+1N - MUL, MLA 1S+ml - MUL 1S+ml - MLA 1S+(m+1)l - MULL 1S+(m+1)l - MLAL 1S+(m+2)l - CDP 1S+bl - LDC, STC (n-1)S+2N+bl - MCR 1N+bl+1C - MRC 1S+(b+1)l+1C -
Table 6.3: Cycle Times for ARM7TDMI, taken from [6, p.8]
be interpreted with some caution. The data sheet states that these are the incremental number of cycles required by an instruction, rather than the total number of cycles for which the instruction uses part of the processor [6, p.7]. Therefore, it may be assumed that the table illustrates the number of execute cycles only.
The following attempts to explain each of the variable in Table 6.3:
• n is the number of machine words transferred;
• m is 1 if bits [32:8] of the multiplier operand are all zero or all one; • m is 2 if bits [32:16] of the multiplier operand are all zero or all one; • m id 3 if bits [31:24] of the multiplier operand are all zero or all one; • b is the number of cycles spent in the coprocessor busy-wait loop;
• S is a sequential memory cycle. During this cycle, the processor requests a transfer to or from an address that is either one word or one half word greater than the address used in the preceding cycle;
• N is a non-sequential memory cycle. During this cycle, the processor requests a transfer to or from an address that is unrelated to the address used in the preceding cycle;
because it is performing an internal function and no useful prefetching can be performed at the same time;
• C is a coprocessor register transfer memory cycle.
For simplicity, constant values for S, N, I and m can be assumed. A value of 1 is used for S, N and I. A value of 3 is used for m.
With this in mind, it can be seen that a MicroBlaze takes 2 cycles to execute a load whereas the ARM7TDMI would require 3. To execute an add for the MicroBlaze requires 1 cycle which is the same as for the ARM7TDMI. The MicroBlaze requires 3 cycles to execute a multiply. The ARM7TDMI requires between 2 -5 cycles depending on the value of the multiplier. Even though the MicroBlaze is a soft-core RISC processor and the ARM7TDMI is an ASIC processor, these figures are not wildly different.
It was decided not to use either of the clock cycle values in Table 6.2 or Table 6.3 for reasons discussed in Section 6.4