Cesare Di Mauro
PyCon 2015 – Florence April 2015
Writing an 8086 emulator in Python
The geek experience
Writing your own o.s.:
A few steps for a minimal o.s. example:
- write an 8086 boot loader (MBR for floppy)
- take control of the hardware (clear/set interrupts, etc.) - write some text on the screen from the “main” code - loop forever (or “halt” the execution)
The result
Credits to Ben Barbour’s article: http://www.benbarbour.com/write-your-own-hello-world-bootloaderos/
What’s next
Credits to Wikipedia: http://en.wikipedia.org/wiki/HAL_9000
The geeks’ dream!
What’s really next: some graphic!
Credits to Wikipedia: http://en.wikipedia.org/wiki/Mode_13h
A mode 13h (320x200 x 256 colors) example
Beyond 8086: Protected Mode
The old 8086 (Real Mode) was very limited:
- 16-bit code only
- 1MB, segmented address space (64KB segments) - no paging (MMU) & virtualization
The Protected Mode offers:
- 16/32-bit code or 32/64-bit (with Long Mode) - 4GB or 256TB (virtual) linear address space - Paging & virtualization
Using BIOS services: no chance!
Many of them don’t work in protected mode!
Possible solutions:
- Switch back to 8086 (Real) mode
- Use an 8086 Virtual Monitor (vm8086)
Many drawbacks:
- Interrupts served by Real Mode code or… disabled!
- Some 8086 BIOS calls can switch to Protected Mode - Some 8086 BIOS calls can directly use the hardware - No vm8086 in Long Mode
Best compromise: 8086 emulator
Pros:
- Works in Protected Mode, Long Mode, and even on different architectures (ARM, MIPS,PowerPCs, etc.) - Simple routing of hardware accesses (“ports” I/O)
- Simple routing of interrupts disable/enable “requests”
- Perfect sandboxing (full control of the emulator)
Cons:
- Slow (not the most important thing; optimizations possible) - A lot of work (writing and testing)
Why another 8086 emulator?
Existing emulators can be difficult to adapt
Licensing issues (GPL = viral)
No weak neither perfect emulation needed: good enough!
Easy to read, maintain, modify/experiment
Reasonable speed for common cases (make them fast!)
Fun!
Planning
8086 emulator prototype + unit test (in Python)
Simple/minimal PC emulator (in Python too)
Final C version (Windows DLL for testing)
Integration on an hobby o.s. (AROS)
AROS: http://www.aros.org/
Why AROS?
A lightweight and small o.s.
Let you easily experiment ideas
An Amiga o.s. derived/inspired
Passion!
Fun!
Credits to Eric W. Schwart: http://aros.sourceforge.net/downloads/kitty/
What AROS needs
Drivers. Drivers. Have I said drivers?
Difficult port: Amiga o.s. APIs/ABI ≠ Windows or Unix/Posix
Primary need: graphic drivers. Few cards supported…
Primary fallback for graphic drivers: handling VESA modes
VESA mode (not modes!)
Problems:
- Only one VESA mode selectable at boot (from GRUB) - Changing VESA mode requires entire o.s. reboot
Solution. Calling VESA BIOS APIs (INT 10h) let to:
- List available screen modes - Change current mode
- Set/Get palette colors
- Set screen display inside the framebuffer (virtual screens) - More…
How the trick works
Boot time: AROS driver initialize 8086 emulator
AROS driver calls emulator’s INT 10h (VESA BIOS services)
Emulator calls AROS driver’s callbacks when needed
AROS driver gets results from emulator call
AROS: http://www.aros.org/
Emulator overview
Some APIs exposed to set emulator status & callbacks
A couple of APIs to run/stop execution
Execution might hang… by design!
External events & emulation status controlled by caller
It’s an emulator, not a full PC: no hardware emulated!
The 8086 architecture
AH AL AX
BH BL BX
CH CL CX
DH DL DX
General purpose Registers
SI DI SP BP
Index Registers
CS DS SS ES
Segment Registers
Program IP
Counter
Flags
Status Register
All registers are 16-bit Code Segment
Data Segment Stack Segment Extra Segment
Source Index Destination Index Stack Pointer Base Pointer
Registers representation
0 AX
1 CX
2 DX
3 BX
4 SP
5 BP
6 SI
7 DI
8 (INTERNAL) ES
9 (INTERNAL) CS
10 (INTERNAL) SS
11 (INTERNAL) DS
12 *NOT USED*
13 *SCRATCH PAD*
14 (INTERNAL) IP
15 (INTERNAL) FLAGS
Array with 16 16-bit values
# General purpose registers
AX, CX, DX, BX, SP, BP, SI, DI = xrange(8)
# Segment registers
# Cannot be written normally! Use proper write_segment function
INTERNAL_ES, INTERNAL_CS, INTERNAL_SS, INTERNAL_DS = xrange(8, 12)
# Special registers
# Cannot be read or written normally!
# Use proper read/write_ip or read/write_flags functions
INTERNAL_TEMP_REG, INTERNAL_IP, INTERNAL_FLAGS = xrange(13, 16)
Registers definition
Registers class
class Registers(object):
def __init__(self):
self._pointer = Pointer(16 * 2)
def __getitem__(self, index):
return internal_read_word(self._pointer, index * 2)
def __setitem__(self, index, value):
internal_write_word(self._pointer, index * 2, value)
def __len__(self):
return 16
def __add__(self, other):
return self._pointer + other * 2
Accessing registers like in C
registers = Registers() # The unique/global registers data structure
def registers_as_bytes_pointer():
return registers + 0
def pointer_to_byte_register(reg):
return registers_as_bytes_pointer() + register_byte_index_to_offset[reg]
#AL,CL,DL,BL,AH,CH,DH,BH
register_byte_index_to_offset = 0, 2, 4, 6, 1, 3, 5, 7
def inc_reg(reg):
inc_operand_16(registers + reg)
registers[AX]
Registers public interface
def read_register(reg):
return registers[reg]
def write_register(reg, value):
# The 0xffff masking can be avoided in C registers[reg] = value & 0xffff
def read_byte_register(reg):
return pointer_to_byte_register(reg)[0]
def write_byte_register(reg, value):
# The 0xff masking can be avoided in C
pointer_to_byte_register(reg)[0] = value & 0xff
The 8086 memory model
Credits to Brock University: http://www.cosc.brocku.ca/~bockusd/3p92/Local_Pages/8086_achitecture.htm
Physical address = Segment * 16 + Offset
Segments public interface
def read_segment(segment):
return read_register(segment + INTERNAL_ES)
def write_segment(segment, value):
write_register(segment + INTERNAL_ES, value) cache_segment(segment)
… and private!
def cache_segment(segment):
segments_addresses[segment] = memory + \ read_register(segment + INTERNAL_ES) * 16
Pointer class – part 1
class Pointer(object):
def __init__(self, size=0, buffer=None, position=0):
self._buffer = buffer or bytearray(size) self._position = position
def __getitem__(self, address):
return self._buffer[self._position + address]
def __setitem__(self, address, value):
self._buffer[self._position + address] = value
def __add__(self, other):
return Pointer(buffer=self._buffer, position=self._position + other)
def __sub__(self, other):
if isinstance(other, Pointer):
return self._position - other._position else:
return Pointer(buffer=self._buffer,
position=self._position - other)
Pointer class – part 2
class Pointer(object):
[…]
def __iadd__(self, other):
self._position += other return self
def __isub__(self, other):
self._position -= other return self
def __int__(self):
return self._position
Memory data structures
# 1MB + 128KB to protect the upper memory access memory = Pointer((1024 + 128) * 1024)
# 64KB I/O space + 2 bytes to protect the upper I/O access ports = Pointer(64 * 1024 + 2)
# Caches the linear address for every segment
# The current instruction is cached as segment #6 == IP segments_addresses = [memory + 0 for i in xrange(8)]
Memory public interface
def fill_memory(start_address, length, value):
for address in xrange(start_address, start_address + length):
memory[address] = value
def read_byte(address):
return memory[address]
def write_byte(address, value):
memory[address] = value
def read_word(address):
return internal_read_word(memory, address)
def write_word(address, value):
internal_write_word(memory, address, value)
Accessing words (16-bits data)
def internal_read_word(pointer, address):
# WARNING: pointer + address are treated as a linear address,
# so it can cross the 64KB segment limit, like 80286+
return pointer[address] + (pointer[address + 1] << 8)
def internal_write_word(pointer, address, value):
# The 0xff masking can be avoided in C pointer[address] = value & 0xff
# WARNING: 64KB segment cross
# The 0xff masking can be avoided in C pointer[address + 1] = (value >> 8) & 0xff
Resetting the emulator
def reset_8086():
for i in xrange(len(registers)):
write_register(i, 0)
write_register(INTERNAL_CS, 0xffff)
write_register(INTERNAL_FLAGS, 0xf002) for i in xrange(4):
cache_segment(i) # For all four segments. See below
segments_addresses[IP] = memory + read_register(INTERNAL_CS) * 16
ES, CS, SS, DS = xrange(4)
TEMP_REG, IP, FLAGS = xrange(5, 8)
CS=0xffff, IP=0x0000 -> first instruction at 0xffff0
8086 Instructions
Instructions may be preceded by one or more prefixes:
- LOCK
- Data segment override (ES:, CS:, SS:, DS:) - String repeat (REP/REPE, REPNE)
- WAIT (for FPU instructions)
Instruction decoding & execution
By design, LOCK and WAIT prefixes are ignored (NOPs)
Segment and String prefixes must be stored
Prefixes must be be cleared after execution
Opcodes are grouped to simplify decoding and execution
MOV to SS register should be atomic with next instruction
The main loop!
def run_8086():
global running, segment_override, rep_prefix running = True
segment_override = NO_SEGMENT_OVERRIDE rep_prefix = NO_REPEAT
while running:
macro_opcode, parameter = split_opcode[get_byte()]
macro_opcode_execute[macro_opcode](parameter)
NO_SEGMENT_OVERRIDE = 0
SEGMENT_OVERRIDE_ENABLED = 8
NO_REPEAT, REPEAT_ZERO, REPEAT_NOT_ZERO = xrange(3)
(Macro)Grouping opcodes
split_opcode = (
# 0x
(BINARY_MEM_REG_8, ADD), (BINARY_MEM_REG_16, ADD), (BINARY_REG_MEM_8, ADD), (BINARY_REG_MEM_16, ADD), (BINARY_AL_IMM_8, ADD), (BINARY_AX_IMM_16, ADD),
(PUSH_REG, INTERNAL_ES), (POP_SEG, ES),
(BINARY_MEM_REG_8, OR), (BINARY_MEM_REG_16, OR), (BINARY_REG_MEM_8, OR), (BINARY_REG_MEM_16, OR), (BINARY_AL_IMM_8, OR), (BINARY_AX_IMM_16, OR),
(PUSH_REG, INTERNAL_CS), (POP_REG, INTERNAL_CS),
# 1x
(BINARY_MEM_REG_8, ADC), (BINARY_MEM_REG_16, ADC),
# 9x
(XCHG_REG, AX), # NOP!
(XCHG_REG, CX), (XCHG_REG, DX), (XCHG_REG, BX), (XCHG_REG, SP), (XCHG_REG, BP), (XCHG_REG, SI), (XCHG_REG, DI), (INSTRUCTION, CBW), (INSTRUCTION, CWD),
(INSTRUCTION, CALL_FAR_IMM16_IMM16), (INSTRUCTION, WAIT),
(INSTRUCTION, PUSHF), (INSTRUCTION, POPF), (INSTRUCTION, SAHF), (INSTRUCTION, LAHF),
# Ax
(INSTRUCTION_IMM_16, MOV_AL_FROM_DIRECT), (INSTRUCTION_IMM_16, MOV_AX_FROM_DIRECT),
Executing a (macro)instruction
def add_16bit(source, target, result):
global flags_first_operand, flags_second_operand, \ flags_result, flags_operation
flags_first_operand = source flags_second_operand = target flags_result = source + target
write_word_to_location(result, flags_result) flags_operation = FLAGS_ADD16
binary_16_execute = (
add_16bit, or_16bit, adc_16bit, sbb_16bit, and_16bit, sub_16bit, xor_16bit, cmp_16bit, mov_16bit, test_16bit,
)
def binary_mem_reg_16(code):
address, register_pointer = decode_modrm_16bit()
binary_16_execute[code](read_word_from_location(address), read_word_from_location(register_pointer), address)
The Mod/RM byte
7 6 5 4 3 2 1 0
MOD REG R/M
REG -> 8 or 16-bit register
MOD
00 -> Memory, no displacement*
01 -> Memory, 8-bit displacement 10 -> Memory, 16-bit displacement 11 -> 8 or 16-bit register
*MOD=00 -> R/M=[Direct Address]
R/M
000 -> [SI+BX+Displacement]
001 -> [DI+BX+Displacement]
010 -> [SI+BP+Displacement]
011 -> [DI+BP+Displacement]
100 -> [SI+Displacement]
101 -> [DI+Displacement]
110 -> [BP+Displacement]*
111 -> [BX+Displacement]
Decoding the ModR/M
def decode_modrm_16bit():
modrm = get_byte() rm = modrm & 7 mod_ = modrm >> 6
offset = modrm_offset[rm + segment_override]() address = modrm_address_16bit[mod_](offset, rm) reg = (modrm >> 3) & 7
register_pointer = registers + reg return address, register_pointer
modrm_address_16bit = ( mod0_no_displacement, mod1_8bit_displacement, mod2_16bit_displacement, mod3_16bit_register,
)
modrm_offset = ( rm0_bx_si_ds, rm1_bx_di_ds, rm2_bp_si_ss, rm3_bp_di_ss, rm4_si_ds, rm5_di_ds, rm6_bp_ss, rm7_bx_ds, rm0_bx_si, rm1_bx_di, rm2_bp_si, rm3_bp_di, rm4_si, rm5_di, rm6_bp, rm7_bx, )
The FLAGS register
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
1 1 1 1 O D I T S Z 0 A 0 P 1 C
O Overflow D Direction I Interrupt T Trace S Sign Z Zero
A Auxiliary Carry P Parity
C Carry
1 RESERVED – Always 1 0 RESERVED – Always 0
Flags update
Arithmetic operations usually update 6 flags (O, S, Z, A, P, C)
Exceptions: INC/DEC don’t update the Carry!
Logical instructions update S, Z, P; clear O, C; A is undefined
Rotates only updates O and C!
Luckily, many times some flags are undefined
Updating flags has a HUGE impact on performance!
Calculating flags: THE nightmare!
def common_read_flags_8bit(auxiliar):
global flags_operation
flags_operation = FLAGS_NORMAL
# The 0xff masking can be avoided in C result = flags_result & 0xff
carry = (flags_result >> 8) & CARRY_MASK parity = parity_table[result & 0xff]
zero = (result == 0) << ZERO_FLAG sign = result & SIGN_MASK
overflow = ((flags_first_operand ^ flags_second_operand ^ result)
<< (OVERFLOW_FLAG - 7)) & OVERFLOW_MASK
flags = (read_register(INTERNAL_FLAGS) &
(~(CARRY_MASK | PARITY_MASK | AUXILIARY_MASK | ZERO_MASK | SIGN_MASK | OVERFLOW_MASK |
RESERVED3_MASK | RESERVED5_MASK))) | \ (carry | parity | auxiliar | zero | sign | overflow) write_register(INTERNAL_FLAGS, flags)
return flags
def read_auxiliary_add():
return ((flags_first_operand & 0x0f) + \ (flags_second_operand & 0x0f)) & \
AUXILIARY_MASK
A “quantum” approach to flags
Operations do NOT calculate flags every time
Operands, result, and “rough” operation saved
Flags status “collapses” only when needed
If few flags needed, calculate ONLY them!
If more flags needed, full calculation made
Rotates always calculate flags
An example: CMP + JNZ
def cmp_16bit(source, target, result):
global flags_first_operand, flags_second_operand, \ flags_result, flags_operation
flags_first_operand = source flags_second_operand = target flags_result = source - target
flags_operation = FLAGS_SUB16 jump_short_execute = (
cc_o, cc_no, cc_c, cc_nc, cc_z, cc_nz, cc_be, cc_nbe, […]
def jump_short(jump_type):
if jump_short_execute[jump_type](): […]
def cc_nz():
return not read_zero_for_logical_from_operation[flags_operation]() read_zero_for_logical_from_operation = ( […]
read_zero_for_logical_generic_16bit, def read_zero_for_logical_generic_16bit():
return not(flags_result & 0xffff)
Testing the beast
Unit test developed with regular code!
One feature or opcode -> one or multiple tests written
All public APIs tested as well
Exercise as much scenarios as possible
Tests makes it safer (enough!) to experiment
Standard lib unittest module used (supported by PTVS)
Tests in action
C 8086 emulator compiled as DLL
DLL imported by Python wrapper (ctypes)
Python callbacks provided to the DLL
Tests transparently run with the regular test suite
Testing the final C version
The callbacks
def set_on_disable_interrupts(handler):
global on_disable_interrupts on_disable_interrupts = handler
def set_on_enable_interrupts(handler):
global on_enable_interrupts on_enable_interrupts = handler
def set_on_byte_input(handler):
global on_byte_input on_byte_input = handler
def set_on_byte_output(handler):
global on_byte_output on_byte_output = handler
What’s missing
Unary operations (NOT, NEG, shifts, etc.)
String operations (REP MOVS, REP STOS, etc.)
Tracing instructions (not needed; easy to implement)
8086 specific behaviors (too much effort; almost zero return)
Much more tests coverage
Thanks to
My family
To stand me...