Register usage.
Another of the advantages of implementing an emulator using assembly language is that you can control how the target CPU registers are used. A good compiler uses to analyze the code and generate a good register allocation (the assignment between variables and physical registers) for the most common applications. The problem is that a CPU emulator is not a common application. Most of the time the CPU emulator is jumping all around the code (while emulating each instruction) and the compiler would be very good to perform a good register allocation.
The compiler does not know either what is doing the code. However we already have this knowledge and we can use it to improve the emulation. We know that the variables which are more likely to be used are the emulated. That means that a good approach would be to assign physical registers to emulated registers. Or if the number of registers in the target CPU is limited, the most frequently used registers.
Only with this optimization the performance of the emulation is largely increased because the access to memory is reduced and the emulation of the instructions simplified.
There are some high level languages that have directives to assign variables to registers (for example C).
These directives can be used but they use to go against the compiler optimizer and the generated code could be worst.
Another specific benefit that can be obtained from using directly the registers is that some CPUs permit that a register could be used either in different sizes. For example x86 permit to access the register either as 8-bit, 16-bit and 32-bit registers. This can be useful for emulating other CPUs which have the same capability (Z80, M68000) or CPUs which have registers of a smaller data size (for example 8-bit registers).
Optimizations
.Most of the optimizations are related, as we have already said, with instructions which are similar in both the emulated and the target CPU but can not be directly used with a high level language. That includes flag and condition code calculations, data conversions (big endian to little endian, sign extensions and zero extensions, etc) and complex or special instructions.
We will see now some examples.
We already said that one of the most expensive tasks that a CPU emulator can perform is the flag (or condition code) calculation. There are two different reasons for this cost: the abstraction level of a high level language which hides the CPU flags, and some CPUs which does not implement flags or its
implementation is very different from the emulated flags. If the reason was the first one an assembly core will be useful because we will be able to use the CPU flags. For the second reason an assembly
implementation perhaps could be faster but the profit will be minimal.
Two examples of how the flag calculation can be improved are Z80 and 68K emulation in x86 CPUs.
All three CPUs share more or less the same flags: zero flag, carry flag, parity flag, sign flag, overflow flag and similar. Most of these flags (for example carry flag) are hard to calculate using a high level language but as they work in a very similar way in all the CPUs with a few assembly instructions can be emulated.
Even more, the same x86 status word (which carries the flags) can be used for storing, retrieve and restore the emulated flags. We will see even more about how helpful this feature is in the binary translation chapter.
The first example is the translation of a Z80 add instruction to x86. We could see the C and the
assembly implementation. It can be easily seen why the assembly version is a lot of faster. The operation is only performed once and it needs only a few instructions to get the correct flags.
The second example is the same but with a 68K subx instruction.
Another example in the x86 architecture of an assembly instruction which can be used for improving the emulator performance is ‘bswap’ (for 32 bits) and ‘xcgh’ (for 16 bits). We already introduced the problem of the byte ordering in memory (little endian and big endian) and the cost it has the data format conversion. Those two instructions could help to perform this conversion. The instruction BSWAP exchanges the high order 16 bits with the low order 16 bits of the register. At the same time the two bytes
Z80 ADD flag calculation in C.
/* Calculate 8-bit add carry */
#define CalcFlagC(value1, value2) tC.flagC =
((((UINT16) value1 + (UINT16) value2) & 0x100)?1:0);
/* Calculate 4-bit add carry */
#define CalcFlagAc(value1, value2) tC.flagAc =
((((value1 & 0x0f) + (value2 & 0x0f)) & 0x10)?1:0);
/* Calculate Z, P and S flags */
#define CalcFlagZPS(value) tC.flagZPS = ZPSTable[value];
/* Macro for Add instructions */
#define Add(value) { \
CalcFlagC(tC.AF.b.h, value) \ CalcFlagAc(tC.AF.b.h, value) \ tC.AF.b.h = tC.AF.b.h + value; \ CalcFlagZPS(tC.AF.b.h) \
}
Z80 ADD flag calculation in x86 asm (Neil Bradley’s MZ80) sahf
add al, ch lahf seto dl
and ah, 0fbh ; Knock out parity/overflow shl dl, 2
or ah, dl and ah, 0fdh ; No N!
Figure 23. Z80 to x86 flag calculation. C and ASM versions.
sbb ebx, edx
mov dl, ah /* keep temporary copy of old flags in DL */
lahf\n setc byte [x]
seto al
jnz short .z /* if non-zero, cleared */
and dl, 0x40 /* otherwise, unchanged */
and ah, 0xbf /* (get rid of new unwanted Z) */
or ah, dl /* OR in the old, unchanged Z flag */
.z:
Figure 24. m68000 SUBX instruction in x86 (Bart’s Gen68K).
in each 16-bit subword are swapped. The XCHG instruction can be used to swap the low and high order bytes of a bit register. So they can be used to perform a fast data format conversion for 32-bit and 16-bit data.
A data format conversion using common operations (and, or, shifts) is a lot of more expensive that use those instructions. In other architectures other specific instructions can be used to speed up the
conversion. This instructions show how useful can be to have access to some of the specific instructions of the CPU which are hidden by the high level language abstractions.
As an example of a ‘complex’ and non-standard instruction we will see the instructions used in the Z80 and x86 for BCD adjust. BCD is binary coded decimal, a decimal is coded in hexadecimal format and adjustment instructions are used to avoid forbidden digits (‘a’ to ‘f’). This kind of data was used some decades ago for performing decimal calculations. Implementing such instruction is expensive using a high level language or requieres a large precalculated table. But a single x86 instruction can do the entire job.
This is just a simple example of the implementation of complex instructions using either a high level language or assembly can be improved if the source and target CPU have instructions similar.
Another good point in an assembly emulator is that it is easier to control the flow of execution than in high level language where most of the control is performed by the compiler. In most architectures (mainly x86) function calls are expensive and must be avoided. In assembly the developer has more freedom to code the internal flow of execution of the emulator.
One of the best enhancements which can be done to an interpreter emulator is to speed up the fetch-decode process. We will see other ways to increase the performance of fetch-fetch-decode loops but in this case we will talk about inlining the fetch-decode at the end of each instruction. This could me implemented in a high level language but it needs special features (like labels and goto instructions) which can not be found in all the compilers.
An assembly program and therefore an assembly emulator gives more freedom to the programmer to write its code in the way it could improve the performance. We have already seen a lot of example. The fetch-decode loop we told about in the first sections of this chapter is a good target for this flexibility. An assembly coded fetch-decode loop eases the task of implementation and helps to improve the
performance. You don’t have to trust in the compiler implementation of switch statements. And it is a clearer implementation than function tables or label based jump. In assembler is just a load from memory (the table with the address for the emulated instructions) and an indirect jump.
But this is just the basic way it can be emulated. A trained assembly programmer can find many others which can be more suited to the specific emulator. In [25] “emu-mech” we can find almost a dozen different forms of implementing it. We will talk about some of those forms (for example threaded interpreters in the next section).
mov edi, [cyclesRemaining]
xor edx, edx sub edi, byte 15 js near noMoreExec
mov dl, byte [esi] ; Get our next instruction inc esi ; Increment PC
jmp dword [z80regular+edx*4]
(from Neil Bradley’s MZ80 Z80 emulator.)
Figure 25. Timing update and instruction decode inlined at the end instruction implemention.
One the most useful implementations is to inline the fetch-decode at the end of the emulated instruction.
This avoids the jump back to the main loop. So now an emulated instruction will have its own
implementation, and after it has been executed it will read the next opcode, look at the decode table and jump to the code emulating the next instruction. The fetch and decode process can be reduced in number of instructions if some assumptions are made. For example size of each instruction emulation could be fixed to avoid reads from a table.