• No results found

Writing an 8086 emulator in Python

N/A
N/A
Protected

Academic year: 2021

Share "Writing an 8086 emulator in Python"

Copied!
48
0
0

Loading.... (view fulltext now)

Full text

(1)

Cesare Di Mauro

PyCon 2015 – Florence April 2015

Writing an 8086 emulator in Python

(2)

The geek experience

Writing your own o.s.:

A few steps for a minimal o.s. example:

- write an 8086 boot loader (MBR for floppy)

- take control of the hardware (clear/set interrupts, etc.) - write some text on the screen from the “main” code - loop forever (or “halt” the execution)

(3)

The result

Credits to Ben Barbour’s article: http://www.benbarbour.com/write-your-own-hello-world-bootloaderos/

(4)

What’s next

Credits to Wikipedia: http://en.wikipedia.org/wiki/HAL_9000

The geeks’ dream!

(5)

What’s really next: some graphic!

Credits to Wikipedia: http://en.wikipedia.org/wiki/Mode_13h

A mode 13h (320x200 x 256 colors) example

(6)

Beyond 8086: Protected Mode

The old 8086 (Real Mode) was very limited:

- 16-bit code only

- 1MB, segmented address space (64KB segments) - no paging (MMU) & virtualization

The Protected Mode offers:

- 16/32-bit code or 32/64-bit (with Long Mode) - 4GB or 256TB (virtual) linear address space - Paging & virtualization

(7)

Using BIOS services: no chance!

Many of them don’t work in protected mode!

Possible solutions:

- Switch back to 8086 (Real) mode

- Use an 8086 Virtual Monitor (vm8086)

Many drawbacks:

- Interrupts served by Real Mode code or… disabled!

- Some 8086 BIOS calls can switch to Protected Mode - Some 8086 BIOS calls can directly use the hardware - No vm8086 in Long Mode

(8)

Best compromise: 8086 emulator

Pros:

- Works in Protected Mode, Long Mode, and even on different architectures (ARM, MIPS,PowerPCs, etc.) - Simple routing of hardware accesses (“ports” I/O)

- Simple routing of interrupts disable/enable “requests”

- Perfect sandboxing (full control of the emulator)

Cons:

- Slow (not the most important thing; optimizations possible) - A lot of work (writing and testing)

(9)

Why another 8086 emulator?

Existing emulators can be difficult to adapt

Licensing issues (GPL = viral)

No weak neither perfect emulation needed: good enough!

Easy to read, maintain, modify/experiment

Reasonable speed for common cases (make them fast!)

Fun! 

(10)

Planning

8086 emulator prototype + unit test (in Python)

Simple/minimal PC emulator (in Python too)

Final C version (Windows DLL for testing)

Integration on an hobby o.s. (AROS)

AROS: http://www.aros.org/

(11)

Why AROS?

A lightweight and small o.s.

Let you easily experiment ideas

An Amiga o.s. derived/inspired

Passion! 

Fun! 

Credits to Eric W. Schwart: http://aros.sourceforge.net/downloads/kitty/

(12)

What AROS needs

Drivers. Drivers. Have I said drivers?

Difficult port: Amiga o.s. APIs/ABI ≠ Windows or Unix/Posix

Primary need: graphic drivers. Few cards supported…

Primary fallback for graphic drivers: handling VESA modes

(13)

VESA mode (not modes!)

Problems:

- Only one VESA mode selectable at boot (from GRUB) - Changing VESA mode requires entire o.s. reboot

Solution. Calling VESA BIOS APIs (INT 10h) let to:

- List available screen modes - Change current mode

- Set/Get palette colors

- Set screen display inside the framebuffer (virtual screens) - More…

(14)

How the trick works

Boot time: AROS driver initialize 8086 emulator

AROS driver calls emulator’s INT 10h (VESA BIOS services)

Emulator calls AROS driver’s callbacks when needed

AROS driver gets results from emulator call

AROS: http://www.aros.org/

(15)

Emulator overview

Some APIs exposed to set emulator status & callbacks

A couple of APIs to run/stop execution

Execution might hang… by design!

External events & emulation status controlled by caller

It’s an emulator, not a full PC: no hardware emulated!

(16)

The 8086 architecture

AH AL AX

BH BL BX

CH CL CX

DH DL DX

General purpose Registers

SI DI SP BP

Index Registers

CS DS SS ES

Segment Registers

Program IP

Counter

Flags

Status Register

All registers are 16-bit Code Segment

Data Segment Stack Segment Extra Segment

Source Index Destination Index Stack Pointer Base Pointer

(17)

Registers representation

0 AX

1 CX

2 DX

3 BX

4 SP

5 BP

6 SI

7 DI

8 (INTERNAL) ES

9 (INTERNAL) CS

10 (INTERNAL) SS

11 (INTERNAL) DS

12 *NOT USED*

13 *SCRATCH PAD*

14 (INTERNAL) IP

15 (INTERNAL) FLAGS

Array with 16 16-bit values

(18)

# General purpose registers

AX, CX, DX, BX, SP, BP, SI, DI = xrange(8)

# Segment registers

# Cannot be written normally! Use proper write_segment function

INTERNAL_ES, INTERNAL_CS, INTERNAL_SS, INTERNAL_DS = xrange(8, 12)

# Special registers

# Cannot be read or written normally!

# Use proper read/write_ip or read/write_flags functions

INTERNAL_TEMP_REG, INTERNAL_IP, INTERNAL_FLAGS = xrange(13, 16)

Registers definition

(19)

Registers class

class Registers(object):

def __init__(self):

self._pointer = Pointer(16 * 2)

def __getitem__(self, index):

return internal_read_word(self._pointer, index * 2)

def __setitem__(self, index, value):

internal_write_word(self._pointer, index * 2, value)

def __len__(self):

return 16

def __add__(self, other):

return self._pointer + other * 2

(20)

Accessing registers like in C

registers = Registers() # The unique/global registers data structure

def registers_as_bytes_pointer():

return registers + 0

def pointer_to_byte_register(reg):

return registers_as_bytes_pointer() + register_byte_index_to_offset[reg]

#AL,CL,DL,BL,AH,CH,DH,BH

register_byte_index_to_offset = 0, 2, 4, 6, 1, 3, 5, 7

def inc_reg(reg):

inc_operand_16(registers + reg)

registers[AX]

(21)

Registers public interface

def read_register(reg):

return registers[reg]

def write_register(reg, value):

# The 0xffff masking can be avoided in C registers[reg] = value & 0xffff

def read_byte_register(reg):

return pointer_to_byte_register(reg)[0]

def write_byte_register(reg, value):

# The 0xff masking can be avoided in C

pointer_to_byte_register(reg)[0] = value & 0xff

(22)

The 8086 memory model

Credits to Brock University: http://www.cosc.brocku.ca/~bockusd/3p92/Local_Pages/8086_achitecture.htm

Physical address = Segment * 16 + Offset

(23)

Segments public interface

def read_segment(segment):

return read_register(segment + INTERNAL_ES)

def write_segment(segment, value):

write_register(segment + INTERNAL_ES, value) cache_segment(segment)

… and private!

def cache_segment(segment):

segments_addresses[segment] = memory + \ read_register(segment + INTERNAL_ES) * 16

(24)

Pointer class – part 1

class Pointer(object):

def __init__(self, size=0, buffer=None, position=0):

self._buffer = buffer or bytearray(size) self._position = position

def __getitem__(self, address):

return self._buffer[self._position + address]

def __setitem__(self, address, value):

self._buffer[self._position + address] = value

def __add__(self, other):

return Pointer(buffer=self._buffer, position=self._position + other)

def __sub__(self, other):

if isinstance(other, Pointer):

return self._position - other._position else:

return Pointer(buffer=self._buffer,

position=self._position - other)

(25)

Pointer class – part 2

class Pointer(object):

[…]

def __iadd__(self, other):

self._position += other return self

def __isub__(self, other):

self._position -= other return self

def __int__(self):

return self._position

(26)

Memory data structures

# 1MB + 128KB to protect the upper memory access memory = Pointer((1024 + 128) * 1024)

# 64KB I/O space + 2 bytes to protect the upper I/O access ports = Pointer(64 * 1024 + 2)

# Caches the linear address for every segment

# The current instruction is cached as segment #6 == IP segments_addresses = [memory + 0 for i in xrange(8)]

(27)

Memory public interface

def fill_memory(start_address, length, value):

for address in xrange(start_address, start_address + length):

memory[address] = value

def read_byte(address):

return memory[address]

def write_byte(address, value):

memory[address] = value

def read_word(address):

return internal_read_word(memory, address)

def write_word(address, value):

internal_write_word(memory, address, value)

(28)

Accessing words (16-bits data)

def internal_read_word(pointer, address):

# WARNING: pointer + address are treated as a linear address,

# so it can cross the 64KB segment limit, like 80286+

return pointer[address] + (pointer[address + 1] << 8)

def internal_write_word(pointer, address, value):

# The 0xff masking can be avoided in C pointer[address] = value & 0xff

# WARNING: 64KB segment cross

# The 0xff masking can be avoided in C pointer[address + 1] = (value >> 8) & 0xff

(29)

Resetting the emulator

def reset_8086():

for i in xrange(len(registers)):

write_register(i, 0)

write_register(INTERNAL_CS, 0xffff)

write_register(INTERNAL_FLAGS, 0xf002) for i in xrange(4):

cache_segment(i) # For all four segments. See below

segments_addresses[IP] = memory + read_register(INTERNAL_CS) * 16

ES, CS, SS, DS = xrange(4)

TEMP_REG, IP, FLAGS = xrange(5, 8)

CS=0xffff, IP=0x0000 -> first instruction at 0xffff0

(30)

8086 Instructions

Instructions may be preceded by one or more prefixes:

- LOCK

- Data segment override (ES:, CS:, SS:, DS:) - String repeat (REP/REPE, REPNE)

- WAIT (for FPU instructions)

(31)

Instruction decoding & execution

By design, LOCK and WAIT prefixes are ignored (NOPs)

Segment and String prefixes must be stored

Prefixes must be be cleared after execution

Opcodes are grouped to simplify decoding and execution

MOV to SS register should be atomic with next instruction

(32)

The main loop!

def run_8086():

global running, segment_override, rep_prefix running = True

segment_override = NO_SEGMENT_OVERRIDE rep_prefix = NO_REPEAT

while running:

macro_opcode, parameter = split_opcode[get_byte()]

macro_opcode_execute[macro_opcode](parameter)

NO_SEGMENT_OVERRIDE = 0

SEGMENT_OVERRIDE_ENABLED = 8

NO_REPEAT, REPEAT_ZERO, REPEAT_NOT_ZERO = xrange(3)

(33)

(Macro)Grouping opcodes

split_opcode = (

# 0x

(BINARY_MEM_REG_8, ADD), (BINARY_MEM_REG_16, ADD), (BINARY_REG_MEM_8, ADD), (BINARY_REG_MEM_16, ADD), (BINARY_AL_IMM_8, ADD), (BINARY_AX_IMM_16, ADD),

(PUSH_REG, INTERNAL_ES), (POP_SEG, ES),

(BINARY_MEM_REG_8, OR), (BINARY_MEM_REG_16, OR), (BINARY_REG_MEM_8, OR), (BINARY_REG_MEM_16, OR), (BINARY_AL_IMM_8, OR), (BINARY_AX_IMM_16, OR),

(PUSH_REG, INTERNAL_CS), (POP_REG, INTERNAL_CS),

# 1x

(BINARY_MEM_REG_8, ADC), (BINARY_MEM_REG_16, ADC),

# 9x

(XCHG_REG, AX), # NOP!

(XCHG_REG, CX), (XCHG_REG, DX), (XCHG_REG, BX), (XCHG_REG, SP), (XCHG_REG, BP), (XCHG_REG, SI), (XCHG_REG, DI), (INSTRUCTION, CBW), (INSTRUCTION, CWD),

(INSTRUCTION, CALL_FAR_IMM16_IMM16), (INSTRUCTION, WAIT),

(INSTRUCTION, PUSHF), (INSTRUCTION, POPF), (INSTRUCTION, SAHF), (INSTRUCTION, LAHF),

# Ax

(INSTRUCTION_IMM_16, MOV_AL_FROM_DIRECT), (INSTRUCTION_IMM_16, MOV_AX_FROM_DIRECT),

(34)

Executing a (macro)instruction

def add_16bit(source, target, result):

global flags_first_operand, flags_second_operand, \ flags_result, flags_operation

flags_first_operand = source flags_second_operand = target flags_result = source + target

write_word_to_location(result, flags_result) flags_operation = FLAGS_ADD16

binary_16_execute = (

add_16bit, or_16bit, adc_16bit, sbb_16bit, and_16bit, sub_16bit, xor_16bit, cmp_16bit, mov_16bit, test_16bit,

)

def binary_mem_reg_16(code):

address, register_pointer = decode_modrm_16bit()

binary_16_execute[code](read_word_from_location(address), read_word_from_location(register_pointer), address)

(35)

The Mod/RM byte

7 6 5 4 3 2 1 0

MOD REG R/M

REG -> 8 or 16-bit register

MOD

00 -> Memory, no displacement*

01 -> Memory, 8-bit displacement 10 -> Memory, 16-bit displacement 11 -> 8 or 16-bit register

*MOD=00 -> R/M=[Direct Address]

R/M

000 -> [SI+BX+Displacement]

001 -> [DI+BX+Displacement]

010 -> [SI+BP+Displacement]

011 -> [DI+BP+Displacement]

100 -> [SI+Displacement]

101 -> [DI+Displacement]

110 -> [BP+Displacement]*

111 -> [BX+Displacement]

(36)

Decoding the ModR/M

def decode_modrm_16bit():

modrm = get_byte() rm = modrm & 7 mod_ = modrm >> 6

offset = modrm_offset[rm + segment_override]() address = modrm_address_16bit[mod_](offset, rm) reg = (modrm >> 3) & 7

register_pointer = registers + reg return address, register_pointer

modrm_address_16bit = ( mod0_no_displacement, mod1_8bit_displacement, mod2_16bit_displacement, mod3_16bit_register,

)

modrm_offset = ( rm0_bx_si_ds, rm1_bx_di_ds, rm2_bp_si_ss, rm3_bp_di_ss, rm4_si_ds, rm5_di_ds, rm6_bp_ss, rm7_bx_ds, rm0_bx_si, rm1_bx_di, rm2_bp_si, rm3_bp_di, rm4_si, rm5_di, rm6_bp, rm7_bx, )

(37)

The FLAGS register

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

1 1 1 1 O D I T S Z 0 A 0 P 1 C

O Overflow D Direction I Interrupt T Trace S Sign Z Zero

A Auxiliary Carry P Parity

C Carry

1 RESERVED – Always 1 0 RESERVED – Always 0

(38)

Flags update

Arithmetic operations usually update 6 flags (O, S, Z, A, P, C)

Exceptions: INC/DEC don’t update the Carry!

Logical instructions update S, Z, P; clear O, C; A is undefined

Rotates only updates O and C!

Luckily, many times some flags are undefined

Updating flags has a HUGE impact on performance!

(39)

Calculating flags: THE nightmare!

def common_read_flags_8bit(auxiliar):

global flags_operation

flags_operation = FLAGS_NORMAL

# The 0xff masking can be avoided in C result = flags_result & 0xff

carry = (flags_result >> 8) & CARRY_MASK parity = parity_table[result & 0xff]

zero = (result == 0) << ZERO_FLAG sign = result & SIGN_MASK

overflow = ((flags_first_operand ^ flags_second_operand ^ result)

<< (OVERFLOW_FLAG - 7)) & OVERFLOW_MASK

flags = (read_register(INTERNAL_FLAGS) &

(~(CARRY_MASK | PARITY_MASK | AUXILIARY_MASK | ZERO_MASK | SIGN_MASK | OVERFLOW_MASK |

RESERVED3_MASK | RESERVED5_MASK))) | \ (carry | parity | auxiliar | zero | sign | overflow) write_register(INTERNAL_FLAGS, flags)

return flags

def read_auxiliary_add():

return ((flags_first_operand & 0x0f) + \ (flags_second_operand & 0x0f)) & \

AUXILIARY_MASK

(40)

A “quantum” approach to flags

Operations do NOT calculate flags every time

Operands, result, and “rough” operation saved

Flags status “collapses” only when needed

If few flags needed, calculate ONLY them!

If more flags needed, full calculation made

Rotates always calculate flags

(41)

An example: CMP + JNZ

def cmp_16bit(source, target, result):

global flags_first_operand, flags_second_operand, \ flags_result, flags_operation

flags_first_operand = source flags_second_operand = target flags_result = source - target

flags_operation = FLAGS_SUB16 jump_short_execute = (

cc_o, cc_no, cc_c, cc_nc, cc_z, cc_nz, cc_be, cc_nbe, […]

def jump_short(jump_type):

if jump_short_execute[jump_type](): […]

def cc_nz():

return not read_zero_for_logical_from_operation[flags_operation]() read_zero_for_logical_from_operation = ( […]

read_zero_for_logical_generic_16bit, def read_zero_for_logical_generic_16bit():

return not(flags_result & 0xffff)

(42)

Testing the beast

Unit test developed with regular code!

One feature or opcode -> one or multiple tests written

All public APIs tested as well

Exercise as much scenarios as possible

Tests makes it safer (enough!) to experiment

Standard lib unittest module used (supported by PTVS)

(43)

Tests in action

(44)

C 8086 emulator compiled as DLL

DLL imported by Python wrapper (ctypes)

Python callbacks provided to the DLL

Tests transparently run with the regular test suite

Testing the final C version

(45)

The callbacks

def set_on_disable_interrupts(handler):

global on_disable_interrupts on_disable_interrupts = handler

def set_on_enable_interrupts(handler):

global on_enable_interrupts on_enable_interrupts = handler

def set_on_byte_input(handler):

global on_byte_input on_byte_input = handler

def set_on_byte_output(handler):

global on_byte_output on_byte_output = handler

(46)

What’s missing

Unary operations (NOT, NEG, shifts, etc.)

String operations (REP MOVS, REP STOS, etc.)

Tracing instructions (not needed; easy to implement)

8086 specific behaviors (too much effort; almost zero return)

Much more tests coverage

(47)

Thanks to

My family

To stand me...

(48)

Q & A

References

Related documents