The fetch-decode loop - Study of the techniques for emulation programming by Victor Moya del Ba

The fetch-decode loop is the main loop of the CPU emulator. A real CPU works in this way: it gets a byte or some bytes from the memory which are located in a position pointed by a special register (commonly called PC or Program Counter). Then these bytes are used to decide which instruction must execute the CPU. And when it has decided what it has to do it executes the function and reads another byte or group of bytes.

The byte or group of bytes which define a single instruction in a CPU are usually opcode or operation code. Some times the term opcode is used for all the bytes which form the instruction, including offsets, extended fields and other literal data which is passed inlined to the instruction. In other cases the opcode only means the part which really defines the instruction to be executed.

S Z X Ac X P X C

i8080 PSW (F)

a) Flags stored as a single state word

i8080Reg AF; /* Register AF (Accumulator + Flags) */

b) Flags stored as different fields

UINT32 flagC; /* Carry flag */

UINT32 flagZPS; /* Zero, Parity and Negative flags */

UINT32 flagAc; /* Auxiliary Carry flag (4-bit carry) */

Figure 8. Examples of CPU flags.

As we have said when a CPU executes an instruction it passes a number of phases. Those phases are usually called fetch phase, decode phase and execution phase (basically, in a more complicated CPU there are more phases and other different phases). In the fetch phase the CPU gets the code data from the memory pointed by the special register PC and stores that data, the opcode, in an internal register.

Fetching could be considered as reading code, rather than read, which would be reading data. In the decode phase the CPU uses fetched data to decide which actions (which instruction) must be executed and sends the proper signals to the functional units. In the execution phase the CPU executes the actions that must be done for the opcode read. Our CPU emulator will work in a similar manner.

The fetch-decode loop is the part of the CPU emulator which implements the CPU fetch and decode phases. The fetch is implemented just reading from the emulated memory at the position pointed by the emulated PC a given number of bytes. Then those bytes, which we will call the opcode, will be decoded.

Our goal is to perform the decoding as fast as possible because that is a task which is very frequent (any instruction must be decoded!). Decoding means to decide which function must be executed for the given opcode.

a) i8080 instruction (1 byte).

1000 1XXX ADD REG

27 DAA

CC PP QQ CZ LABEL

b) Z80 instruction (extended opcode).

ED 43 PP QQ LD (addr), A

DD 0111 0sss YY LD (IX + disp), reg FD 0111 0sss YY LD (IY + disp), reg c) m68000 instruction.

5 bbb1 0 0ddd SUBQ.B data3, Dn

5 bbb1 00ff ffff SUBQ.B data3, dadr d) MIPS instruction.

6 5 5 5 5 6 ADD rd, rs, rt

6 5 5 16 ADDI rt, rs, immediate

Figure 9. Opcode examples.

SPECIAL

00000 rs rt rd 00000

ADD 100000

ADDI

00100 rs rt Immediate

The number of bytes to read on each fetch depends upon the processor. Some CPUs have fixed length opcodes and then the size in bytes of an opcode is always the same, that eases the task of fetching because you already know how many must be read for each instruction. This uses to happen with RISC CPUs.

Other CPUs, mostly old CISC ones, have variable length opcodes. In this case the size in bytes of each opcode depends upon the type of each instruction, some will need more and other less bytes.

That means that the fetching must be done in two or more steps. At first it is read the number of bytes needed to decode a simple instruction or too differentiate between the different kind of instructions.

Those instructions which need more bytes to be fully decoded or executed will perform more memory reads from the address in the PC. The additional fetching is implemented in the same functions which emulate those extended size instructions.

Each time a byte is fetched from the memory (code is read) the PC (which points to the next byte of code) must be updated. At the first step of the fetching, in the main loop, the PC is incremented in a fixed amount (the basic opcode size). The instruction implementation (in CPUs with variable length

instructions) must update itself the PC if it performs more code reads. The PC is also affected by control flow instructions (jump, branch, call and return instructions) and by hardware interrupts and CPU exceptions.

The decoding can be performed in different ways. It depends upon how the instruction is encoded in the opcode for the emulated CPU and also in the capabilities which of the language we are using. RISC kind CPUs for example can be more easily decoded because they have fixed opcode formats which different fields which define the different kind of instructions, the registers to use, literals and so. They also, as it has been said, use a fixed in length opcode format.

CISC instructions in the other way do not use to have standarized fields in the opcodes for each information. And they have instructions in various sizes, extensions of the normal opcodes, prefixes and other weird stuff.

On RISC CPUs or in any CPU where the different bits in the opcode can be grouped to form different information (opcode, registers, literals, special opcodes) the first thing to do could be to differentiate (using macros, or copying them to different variables) the different fields in the opcode. The use the field which determines the instruction type to determine which instruction to execute. The other fields could be used for further decoding (if that type of instruction has different final instructions) or to be used for the instruction implementation (where to get the data, where to store the data, data sizes, etc).

while (executed_cycles < cycles_to_execute) {

opcode = memory[PC++];

instruction = decode(opcode);

execute(instruction);

}

Figure 10. CPU core main loop.

On CISC machines where it is hard to find the different fields or there is not a standardized encoding format for all the instructions the decoding uses to be done with all the read opcode.

OPCODE RD RS RS

OP RD, RS, RT

RISC kind instruction.

/* fetch */

opcode = fetch(PC);

/* first decode */

opcfield = OPCODE(opcode);

destreg = DESTREG(opcode);

sreg = SREG(opcode);

treg = TREG(opcode);

/* last decode, execute */

execute(opcfield, destreg, sreg, treg);

Figure 11. RISC instruction decoding (fixed lenght).

/* fetch */

opcode = fetch(PC);

/* decode and execute */

execute(opcode);

Figure 12. CISC instruction decoding (variable lenght).

The process of determining which function or code must be executed to emulate each instruction can be implemented in different manners. The first we could propose is an array of if statements, but those are very inefficient because the last opcode will take N (N number of opcodes) times the time of decoding of the first one. The better implementation is to use indexed or indirect jumps.

In a high level language an indirect jumps is not direct to implement. The first approach is to use a switch statement, with a case for each of the possible opcode values and hope that the compiler is intelligent enough to translate it into an indirect jump using a jump table, rather than an array of ifs.

if (opcode == OPC1) {

/* emulate the opcode 1 */

}

else if (opcode == OPC2) {

/* emulate the opcode 2 */

} ...

else if (opcode == OPCi) {

/* emulate the opcode i */

if (subopc == OPCiSUBOP1) {

/* emulate subopcode 1 of opcode i */

} ....

else if (subopc == OPCiSUBOPN) {

/* emulate subopcode N of opcode i */

} }

...

else if (opcode == OPCN) {

/* emulate the opcode N */

} else {

/* Wrong opcode. Illegal instruction. */

}

Mean number of condition tests for decoding an instruction: N/2.

Figure 13. Decoding using conditional (if) statements.

If we know the compiler is not making its work correctly the next step would be to implement the jump table at hand as an array or table of functions. The table is indexed by the opcode value and each entry has a pointer to a function. The pointer is retrieved and it is performed a function call.

Yet another option could be to use labels and the goto directive in those language which permit them (for example some C implementations like GNU C).

The easier implementation is to use a switch-case statement, but perhaps it is the slower due to problems with the code generated by the compiler. The function table could be faster but it could suffer from procedure call overhead. The label implementation could hurt the optimizations performed by the compiler (the program is jumping to any place). Those problems and advantages should be taken into account when implementing the process of decoding.

Many times, although the decoding could be implemented in different steps, for example if there is an opcode field and a function opcode, it is better to perform the decoding in a single step. The CPU emulation must be as fast as possible so we could trade memory space for time and use large function tables or switch statements and a lot of functions for each possible opcode. For example in a CPU with 16 bit opcodes, for example de Motorola 68K, the best option could be to use the full 16 bit opcode as the function table index. That would mean 64K table entries and some hundred or thousand of functions.

But the trade-off uses to be good. It must be taken into account that there could be problems with cache usage with those large tables and so many functions.

About decoding and dispatching the instructions we will talk more because it is one of the points it is spent a lot of the effort to optimize CPU emulators. Threaded code/interpreters are used to reduce the overhead of the decoding.

switch (opcode) {

case OPC1:

/* emulate opcode 1 */

break;

case OPC2:

/* emulate opcode 2 */

if (subopc == OPC2SUBOPC1) {

} ...

break;

case OPC3:

/* emulate opcode 3 */

break;

...

case OPCN:

/* emulate opcode N */

break;

default:

/* Wrong opcode. Illegal instruction. */

break;

}

The decoding is performed (good compilers) through an access to a jump table. Just one memory access (tests) needed.

Figure 14. Decoding using swith-case statement.

In document Study of the techniques for emulation programming by Victor Moya del Barrio (Page 30-36)