In this chapter we will start practicing assembly language by gradually writing more complex programs for Linux.
We will observe some architecture details that impact the writing of all kinds of programs (e.g., endianness).
We have chosen a *nix system in this book because it is much easier to program in assembly compared to doing so in Windows.
2.1 Setting Up the Environment
It is impossible to learn programming without trying to program. So we are going to start programming in assembly right now.
We are using the following setup in order to complete assembler and C assignments:
• Debian GNU\Linux 8.0 as an operating system.
• NASM 2.11.05 as an assembly language compiler.
• GCC 4.9.2 as C language compiler. This exact version is used to produce assembly from C programs. Clang compiler can be used as well.
• GNU Make 4.0 as a build system.
• GDB 7.7.1 as a debugger.
• The text editor you like (preferably with syntax highlighting). We advocate ViM usage.
If you want to set up your own system, install any Linux distribution you like and make sure you install the programs just listed. To our knowledge, Windows Subsystem for Linux is also well suited to do all the assignments. You can install it and then install necessary packages using apt-get. Refer to the official guide located at: https://msdn.microsoft.com/en-us/commandline/wsl/install_guide.
On Apress web site for this book, http://www.apress.com/us/book/9781484224021, you can find the following:
• Two preconfigured virtual machines with the whole toolchain installed. One of them has a desktop environment; the other one is just the minimal system that can be accessed through SSH (Secure Shell). The installation instructions and other usage information is located in the README.txt file in the downloaded archive.
• A link to GitHub page with all the book’s listings, answers to the questions, and solutions.
2.1.1 Working with Code Examples
Throughout this chapter, you will see numerous code examples. Compile them and if you have difficulty grasping their logic, try to execute them step by step using gdb. It is a great help in studying code. See Appendix A for a quick tutorial on gdb.
Appendix D provides more information about the system used for performance tests.
2.2 Writing “Hello, world”
2.2.1 Basic Input and Output
Unix ideology postulates that “everything is a file.” A file, in a large sense, is anything that looks like a stream of bytes. Through files one can abstract such things as
• data access on a hard drive/SSD;
• data exchange between programs; and
• interaction with external devices.
We will follow the tradition of writing a simple “Hello, world!” program for a start. It displays a welcome message on screen and terminates. However, such a program must show characters on screen, which cannot be done directly if a program is not running on bare metal, without an operating system babysitting its activity. An operating system’s purpose is, among other things, to abstract and manage resources, and display is surely one of them. It provides a set of routines to handle communication with external devices, other programs, file systems, and so on. A program usually cannot bypass the operating system and interact directly with the resources it controls. It is limited to system calls, which are routines provided by an operating system to user applications.
Unix identifies a file with its descriptor as soon as it is opened by a program. A descriptor is nothing more than an integer value (like 42 or 999). A file is opened explicitly by invoking the open system call;
however, three important files are opened as soon as a program starts and thus should not be managed manually. These are stdin, stdout, and stderr. Their descriptors are 0, 1, and 2, respectively. stdin is used to handle input, stdout to handle output, and stderr is used to output information about the program execution process but not its results (e.g., errors and diagnostics).
By default, keyboard input is linked to stdin and terminal output is linked to stdout. It means that
“Hello, world!” should write into stdout.
Thus we need to invoke the write system call. It writes a given amount of bytes from memory starting at a given address to a file with a given descriptor (in our case, 1). The bytes will encode string characters using a predefined table (ASCII-table). Each entry is a character; an index in the table corresponds to its code in a range from 0 to 255.
See Listing 2-1 for our first complete example of an assembly program.
Listing 2-1. hello.asm
mov rsi, message ; argument #2 in rsi: where does the string start?
mov rdx, 14 ; argument #3 in rdx: how many bytes to write?
syscall ; this instruction invokes a system call
This program invokes a write system call with correct arguments on lines 6-9. It is really the only thing it does. The next sections will explain this sample program in greater detail.
2.2.2 Program Structure
As we remember from the von Neumann machine description, there is only one memory, for both code and data; those are indistinguishable. However, a programmer wants to separate them. An assembly program is usually divided into sections. Each section has its use: for example, .text holds instructions, .data is for global variables (data available in every moment of the program execution). One can switch back and forth between sections; in the resulting program all data, corresponding to each section, will be gathered in one place.
To get rid of numeric address values programmers use labels. They are just readable names and addresses. They can precede any command and are usually separated from it by a colon. There is one label in this program at line 5. _start.
A notion of variable is typical for higher-level languages. In assembly language, in fact, notions of variables and procedures are quite subtle. It is more convenient to speak about labels (or addresses).
An assembly program can be divided into multiple files. One of them should contain the _start label. It is the entry point; it marks the first instruction to be executed.
This label should be declared global (see line 1). The meaning of it will be evident later.
Comments start with a semicolon and last until the end of the line.
Assembly language consists of commands, which are directly mapped into machine code. However, not all language constructs are commands. Others control the translation process and are usually called directives.1
In the “Hello, world!” example there are three directives: global, section, and db.
■
Note assembly language is, in general, case insensitive, but label names are not!
mov
,
mOV,
Movare all the same thing, but
global _startand
global _STARTare not! section names are case sensitive too:
section .DATAand
section .datadiffer!
The db directive is used to create byte data. Usually data is defined using one of these directives, which differ by data format:
• db—bytes;
• dw—so-called words, equal to 2 bytes each;
• dd—double words, equal to 4 bytes; and
• dq—quad words, equal to 8 bytes.
Let’s see an example, in Listing 2-2.
Listing 2-2. data_decl.asm section .data
example1: db 5, 16, 8, 4, 2, 1 example2: times 999 db 42 example3: dw 999
1The NASM manual also uses the name “pseudo instruction” for a specific subset of directives.
times n cmd is a directive to repeat cmd n times in program code. As if you copy-pasted it n times. It also works with central processor unit (CPU) instructions.
Note that you can create data inside any section, including .text. As we told you earlier, for a CPU data and instructions are all alike and the CPU will try to interpret data as encoded instructions when asked to.
These directives allow you to define several data objects one by one, as in Listing 2-3, where a sequence of characters is followed by a single byte equal to 10.
Listing 2-3. hello.asm
message: db 'hello, world!', 10
Letters, digits, and other characters are encoded in ASCII. Programmers have agreed upon a table, where each character is assigned a unique number—its ASCII-code. We start at address corresponding to the label message. We store the ASCII codes for all letters of string "hello, world!", then we add a byte equal to 10. Why 10? By convention, to start a new line we output a special character with code 10.
■
Terminological chaos It is quite common to refer to the integer format most native to the computer as machine word. as we are programming a 64-bit computer, where addresses are 64-bit, general purpose registers are 64-bit, it is pretty convenient to take the machine word size as 64 bits or 8 bytes.
In assembly programming for Intel architecture the term word was indeed used to describe a 16-bit data entry, because on the older machines it was exactly the machine word. unfortunately, for legacy reasons, it is still used as in old times. that’s why 32-bit data is called double words and 64-bit data is referred to as quad words.
2.2.3 Basic Instructions
The mov instruction is used to write a value into either register or memory. The value can be taken from other register or from memory, or it can be an immediate one. However,
1. mov cannot copy data from memory to memory;
2. the source and the destination operands must be of the same size.
The syscall instruction is used to perform system calls in *nix systems. The input/output operations depend on hardware (which can be also used by multiple programs at the same time), so programmers are not allowed to control them directly, bypassing the operating system.
Each system call has a unique number. To perform it 1. The rax register has to hold system call’s number;
2. The following registers should hold its arguments: rdi, rsi, rdx, r10, r8, and r9.
System call cannot accept more than six arguments.
3. Execute syscall instruction.
It does not matter in which order the registers are initialized.
Note, that the syscall instruction changes rcx and r11! We will explain the cause later. When we wrote the “Hello, world!” program we used a simple write syscall. It accepts
1. File descriptor;
2. The buffer address. We start taking consecutive bytes for writing from here;
3. The amount of bytes to write.
To compile our first program, save the code in hello.asm2 and then launch these commands in the shell:
> nasm -felf64 hello.asm -o hello.o
> ld -o hello hello.o
> chmod u+x hello
The details of compilation process along with compilation stages will be discussed in Chapter 5. Let’s launch “Hello, world!”
> ./hello hello, world!
Segmentation fault
We have clearly output what we wanted. However, the program seems to have caused an error. What did we do wrong? After executing a system call, the program continues its work. We did not write any instructions after syscall, but the memory holds indeed some random values in the next cells.
■
Note If you did not put anything at some memory address, it will certainly hold some kind of garbage, not zeroes or any kind of valid instructions.
A processor has no idea whether these values were intended to encode instructions or not. So, following its very nature, it tries to interpret them, because rip register points at them. It is highly unlikely these values encode correct instructions, so an interrupt with code 6 will occur (invalid instruction).3
So what do we do? We have to use the exit system call, which terminates the program in a correct way, as shown in Listing 2-4.
Listing 2-4. hello_proper_exit.asm section .data
message: db 'hello, world!', 10 section .text
global _start _start:
mov rax, 1 ; 'write' syscall number mov rdi, 1 ; stdout descriptor mov rsi, message ; string address
mov rdx, 14 ; string length in bytes syscall
mov rax, 60 ; 'exit' syscall number xor rdi, rdi
syscall
2Remember: all source code, including listings, can be found on www.apress.com/us/book/9781484224021 and is also stored in the home directory of the preconfigured virtual machine!
3Even if not, soon the sequential execution will lead the processor to the end of allocated virtual addresses, see section 4.2. In the end, the operating system will terminate the program because it is unlikely that the latter will recover from it.
■