Let's Learn x86-64 Assembly! Part 2 - We're Writing a Virtual Machine

This post is a part of a series on x86-64 assembly programming that I'm writing. Check out part 0 and part 1!.

In the previous parts we've covered some basics, examined the Windows x64 calling convention, and demonstrated the use of Flat Assembler's powerful macro system. It is now time for us to do something interesting with the tools that we've acquired.

I really like the idea of building emulators for weird fantasy hardware. Back when Notch published docs for DCPU-16 (RIP), I wrote one of the first emulators for it in C++. And in the unlikely event that you're a returning reader of this blog, you may have seen one of my old posts, "My Most Important Project Was a Bytecode Interpreter", where I talk about my experiments with a stack-based virtual machine.

So I thought, why not build an emulator for a fantasy CPU for this series? It would give us a pretty nice overview to the x86 instructiuon set basics, and it's a simple enough task that we won't have to get too far into the weeds.

In this post, I'll describe the instruction set of the fantasy CPU we'll be emulating, lay the foundation for our emulator and implement a few of the instructions.

Introducing QBX

For this project, I've chosen to emulate a 16-bit, register-based machine (not too far from DCPU-16, actually). There will be no floating point operations. I've decided to name it QBX, a recursive acronym for "QBX Bytecode eXecutor".

QBX will have 4 16-bit general-purpose registers (GPRs): q0, q1, q2 and q3.

In addition to the 4 GPRs, we'll have a few special registers:

  • qip -the instruction pointer. This is the moral equivalent of the x64 rip register, it contains the address of the next instruction to be executed.
  • qsp - the stack pointer (QBX is going to have stack manipulation instructions). The equivalent of rsp in x64.
  • qflags - the "flags" register, contains a set of bits that depend on the outcome of prior instructions. Some instructions (like arithmetic operations) affect the flags register, while others (such as a conditional branch) depend on the values of certain bits in the flags register. The x64 counterpart register is rflags.

In our implementation, rather than storing the contents of the "virtual" registers in some block of memory, each QBX register will map directly onto a real x86-64 register. This should eliminate a good number of redundant memory load/store operations.

Here is the mapping of QBX to x64 registers that I'm going to use:


Note that this requirement has a very important implication: things like "write value from q0 to q1" and "write value from q2 to q3" have to be treated as entirely separate instructions, and not simply as different instances of the same instruction. Why is that? Why not have a generic implementation of an instruction, and give it parameters which determine the registers to use?

If the contents of QBX registers were stored in memory, we indeed could do that, and simply pass the memory address corresponding to the virtual register into the instruction implementation. But remember that one of our goals is to maintain direct correspondence between virtual QBX registers and actual x86-64 registers. Since there is no indirect way to refer to a register (there's no such thing as "pointer to register"!), a "generic" implementation for an instruction isn't possible. In this case, we're making a tradeoff: speeding up emulated programs at the expense of increasing the VM's executable size.

Thus, most QBX instructions will be encoded with a single 2-byte value, called the "instruction code" (there will be some exceptions, which we'll call out explicitly). Things like "write value from q0 to q1" and "write value from q2 to q3" shall be represented with distinct instruction codes.

QBX will follow the Von Neumann model, storing both instructions and data in the same memory. Since we're not going to implement any kind of protection measures, QBX programs can actually be self-modifying.

Let's now give a brief overview of the QBX instruction set. I'll be providing the exact details on what each instruction does as we go through implementing them. For now we'll just list the general categories:

  • Memory manipulation instructions: load for reading data from memory into a register, stor for writing data from a register into memory, and instructions for moving data between registers.
  • Stack manipulation - push, pop
  • Flow control - unconditional jump, jump if zero flag is set, etc.
  • Arithmetic - add, subtract, multiply, divide
  • Logic - bitwise and, or, not, as well as bit shift operations

We will also eventually be adding the capability to draw characters to screen and read the keyboard status by means of memory-mapped I/O.

The source code of QBX will be posted to github as this series progresses.

Execution Loop Overview

Let's briefly examine how our implementation will be reading and executing instructions for the virtual machine.

We'll have a single readable-writeable chunk of memory (referred to as "QBX memory") in the data segment, for storing instructions and data belonging to the program being executed by QBX. Another chunk of memory, which we'll call the "jump table", will store a mapping from QBX instruction codes to their corresponding implementations.

On every iteration, we fetch the next instruction by reading the QBX memory at the offset specified by the value in the qip register. The value in the qip register is then changed to point to the next instruction to execute.

After that, the current instruction's code is used to find the address of the implementation for that instruction, using the jump table. The execution is transferred to that address.

The instruction implementation is a tiny piece of code that modifies parts of QBX state (registers, memory, flags) according to the instruction's definition, and transfers execution back to the instruction fetch step. This continues until the process either exits or crashes.

Helper Macros

We'll start by setting up a few helper macros, to manage some mundane and repetitive tasks. One of such tasks is invoking the Windows API functions - we will need them for exiting the process cleanly, as well as for input and output later. The file win64_helpers.inc contains all the relevant macros for creating import tables and invoking a function. Since I already described them in great detail in the previous part, I will not be spending much time on them here.

Next, we'll create a special macro that will receive the name of another macro as an argument and apply it to the list of all QBX instructions. This will help us avoid having to list all QBX instructions multiple times. Since that list will eventually get pretty long, we'll put it into its own file, qbx_instructions.inc. For now, its contents will be:

; applies a macro to the list of all QBX instruction names.
macro qbx_insns m, [arg*] {
      m arg, \
      noop, \

We will need a few more macros for dealing with instructions. Those will go into qbx_insn_helpers.inc. The first one, given a list of instruction names, will define an instruction code constant for each of them:

; define constants for each instruction code.
macro define_icodes base, [insn*] {
        local next_icode
        next_icode = base
        insn = next_icode
        next_icode = next_icode + 1

The next helper macro, given a list of instruction names, will define a jump table mapping instruction codes to their implementations:

macro define_jmp_table jmp_table_name*, [op*] {
        next_opcode = 0
        ; define constants for each instruction code.
        op = next_opcode
        next_opcode = next_opcode + 1
		; jump table name
        ; IMPL_* are defined by the `insn` macro.
        dw IMPL_#op

Reminder - if you're unfamiliar with the FASM macro system, I recommend reviewing part 1.

Finally, the last two helper macros are:

; marks the beginning of the QBX instruction implementation.
macro insn name {
      IMPL_#name = $ - insn_base

; marks the end of a QBX instruction implementation.
macro endinsn {
      ; jump back to the instruction fetch step
      jmp advance

; a QBX instruction is defined like so:
; insn instruction_name
; ... implementation goes here...
; endinsn

The insn macro defines a new assembly-time variable equal to the offset of the given instruction implementation from the insn_base label (incidentally $ in FASM stands for "current address"). define_jmp_table creates a table of those offsets. The point of storing offsets rather than absolute addresses is to simply conserve space - an absolute address requires a quad word (8 bytes) per table entry on a 64-bit architecture, but to store offsets to all instructions we only actually need 2 bytes per entry.

In qbx_registers.inc, we'll define a few aliases for mapping QBX registers onto x86-64 ones:

; general-purpose registers
q0 equ r12w
q1 equ r13w
q2 equ r14w
q3 equ r15w
qaddr equ r12 ; needed for load/store insns

q0b equ r12b
q1b equ r13b
q2b equ r14b
q3b equ r15b

; stack ptr
qsp  equ rbx
qspw equ bx
qspb equ bl

; insn ptr
qip equ rsi

; flags
qflags equ r11

Implementing the Execution Loop and Our First Instructions

Let's move on to implementing our main execution loop. It's fairly small, but first we need to get some bootstrapping out of the way, like including files with our helper macros and defining import tables:

format PE64 NX GUI 6.0
entry start

include 'win64_helpers.inc'
include 'qbx_instructions.inc'
include 'qbx_insn_helpers.inc'
include 'qbx_registers.inc'

section '.idata' import readable writeable
        import_directory_table KERNEL32, USER32
        import_functions KERNEL32, \
                         AllocConsole, \
                         WriteConsoleOutputA, \
                         GetStdHandle, \
        import_functions USER32, MessageBoxA 

As a next step, let's define the constants for all instruction codes by applying the qbx_insns macro to the define_icodes macro:

; define constants for all instruction codes
qbx_insns define_icodes, 0

Next comes our data section:

section '.data' data readable writeable
        ; define the jump table.
        qbx_insns define_jmp_table, qbx_jmp_table
		; QBX memory
        qbx_mem dw noop
                dw halt
                db 1024 dup ?

Here, we're first defining our jump table using the helper macro from earlier. Next comes the chunk of QBX memory, labelled qbx_mem. As you can see, I have hardcoded a program into it - one that does nothing for one cycle and then immediately exits. Later, we'll have better ways to place executable code into QBX memory.

Now, let's examine the part of the code that implements the execution loop. I've added comments inline:

section '.code' code readable executable
        start:               ; entry point - program starts here
                int3         ; breakpoint for the debugger
                xor qip, qip ; zero out instruction pointer
                xor rdi, rdi ; rdi will hold the next instruction code

                mov di, word [qbx_mem + qip]               ; read the next instruction
                add qip, 2                                 ; advance instruction pointer
                movzx r10, word [qbx_jmp_table + rdi * 2]  ; read offset from jump table
                add r10, insn_base                         ; compute address of insn implementation
                jmp r10                                    ; jump to insn implementation  

We've seen some of this stuff, like int3 and xor, before. Let's pick apart the loop line-by-line:

  • mov di, word [qbx_mem + qip] - qip always holds the address of the next QBX instruction to execute. At the beginning of every loop we use that value to read the next instruction code into the x86-64 register di. Every time you see square brackets in some instruction's operands, it means using the memory at the address given by the bracketed expression - essentially pointer dereferencing. In this case, we're reading from the memory pointed to by the qbx_mem label, using an offset equal to qip. Note that we have to specify the size of the data that we intend to read (word, or two bytes, in this case, since all QBX instruction codes are 2 bytes long).
  • add qip, 2 - this advances the instruction pointer and makes it point to the next instruction to execute.
  • movzx r10, word [qbx_jmp_table + rdi * 2] - here we're reading the offset of the instruction implementation from the jump table. movzx is a special form of mov that writes data from a smaller register (or region of memory) into a larger register, and zeroes out the unused bits in the target.
  • add r10, insn_base - since at this point r10 contains the offset of the instruction implementation from the insn_base label, all we need to do is add the address denoted by insn_base to it, and we have the address of the instruction implementation.
  • jmp r10 - this jump transfers execution to the address of the instruction implementation that we just calculated.

And finally, here are the implementations for our two first instructions:


    insn noop

    insn halt
     	call64 [ExitProcess], 0

noop corresponds to actual x64 instruction nop, and halt simply calls ExitProcess to cease execution.

I recommend building this program and stepping through it with WinDbg (check out part 0 if you're unfamiliar with WinDbg) to see it in action now.

Implementing the Register and Memory Manipulation Instructions

Now that we have the basic machinery for fetching and processing QBX instructions set up, we can get to implementing more of them. We'll start with instructions that read/write data from/to memory and move data between registers.

On x86-64 processors, the mov instruction is responsible for doing that. You can see the detailed documentation for mov on FĂ©lix Coutlier's excellent online x86 reference, which lists the many possible forms of the instruction. As you can see, mov is very flexible: it can move differently sized pieces of data between memory and registers. However, there are some limitations, e.g. you can't mov from memory to memory: you always have to go through a register. You also can't mov between differently sized registers. There are special instructions for that: movzx (MOVe with Zero Extend) and movzx (MOVe with Sign Extend). These two copy a value from a smaller register or piece of memory to a larger register, and set the extra bits in the target register to zero (for movzx) or the sign bit of the value (for movsx). Note that none of these instructions affect any flags.

Let's start by implementing the QBX instructions that place a value into a register. For reasons I mentioned earlier in this post, we have to implement separate instructions for placing a value into registers q0, q1, q2 and q3 respectively. Luckily, with the help of FASM's macro system it won't be too much hassle.

We'll first add the new instructions to the list of all QBX instructions in qbx_instructions.inc:

macro qbx_insns m, [arg*] {
      m arg, \
      noop,  \
      halt,  \
      moviwq0, \
      moviwq1, \
      moviwq2, \

The mnemonic is "MOVe Immediate Word to ". These instructions read the word from memory at the address specified by qip into the appropriate QBX register, and advance qip by 2. Let's take a look at the implementation:

    ; move immediate word-sized value into register.
     rept 4 reg:0 {
          insn moviwq#reg
               mov q#reg, word [qbx_mem + qip]
               add qip, 2

Here, we're seeing a new feature of the FASM macro system - the rept directive. rept instructs the assembler to repeat a block of code a fixed number of times. The number of times to repeat is passed in as the first parameter of the directive. The second parameter of the directive is the name of the counter, which can be referred to from within the code block. By default the counter is 1-based, but here we're making it 0-based.

We can now add the "MOVe Immediate Byte" group of instructions, that operates in a similar fashion (don't forget to add them to qbx_instructions.inc as well!):

        ; move immediate byte-sized value into register.
         rept 4 reg:0 {
              insn movibq#reg
                   mov q#reg#b, byte [qbx_mem + qip]
                   add qip, 1

Slightly more involved is the group of instructions for moving values between registers. It requires two nested repts and a check to make sure we're not generating code for redundant moves (i.e. q0 to q0).

        ; move word-sized value between registers.
        rept 4 tgt:0 {
             rept 4 src:0 \{
                  if ~(tgt eq \src)
                       insn movwq#tgt#q\#src
                            mov q#tgt, q\#src
                  end if

        ; move byte-sized value between registers.
        rept 4 tgt:0 {
             rept 4 src:0 \{
                  if ~(tgt eq \src)
                       insn movbq#tgt#q\#src
                            mov q#tgt#b, q\#src\#b
                  end if

Now that we've covered moving data between registers, let's turn our attention to reading and writing memory. The load group of instructions reads memory, while the stor group of instructions writes it. Every memory-related instruction requires the address that is being read or written, the register that is being loaded into or stored, as well as the size of data being read or written. The way the address is specified is determined by the instruction's addressing mode. For simplicity, QBX has only two straightforward addressing modes:

  • Direct - the absolute address at which to read or write memory is specified directly after the instruction code.
  • Indirect - the absolute address at which to read or write memory is specified in the register q0.
Thus, our memory instructions will look like "load a byte-sized value into register q3 at address specified in q0".

Below is the implementation for stores (loads are analogous). Note that qaddr maps to r12 which is the 64-bit extension of the q0 register - we need 64-bit registers to calculate addresses.

        ; store word-sized value to direct address.
        rept 4 reg:0 {
             insn storwdq#reg
                movzx rcx, word [qbx_mem + qip]
                add qip, 2 ; note that we need to modify qip to skip the address.
                mov word [qbx_mem +  rcx], q#reg

        ; store byte-sized value to direct address.
        rept 4 reg:0 {
             insn storbdq#reg
                movzx rcx, word [qbx_mem + qip]
                add qip, 2
                mov byte [qbx_mem +  rcx], q#reg#b

        ; store word-sized value to address in q0.
        rept 3 reg {
             insn storwiq#reg
                mov word [qbx_mem + qaddr], q#reg

        ; store byte-sized value to address in q0.
        rept 3 reg {
             insn storbiq#reg
                mov byte [qbx_mem + qaddr], q#reg#b

Implementing the Stack Manipulation Instructions

The stack is an area of memory where programs can put some intermediate data. As we've seen before, function return addresses, some local variables and function arguments live on the stack. The QBX virtual stack is distinct from our thread's actual stack.

The stack is characterized by its size, and a stack pointer. There are two operations we can do with it - push and pop.

Pushing data from a register onto the stack amounts to:

  • decreasing the stack pointer by the size of the written data;
  • writing the data to the stack.
Popping data from the stack into a register is:
  • reading the required amount from the stack;
  • increasing the stack pointer by the required amount.
Incidentally, the x86-64 stack manipulation instructions follow the same rules as well.

These operations are somewhat counter-intuitive because the stack pointer decreases when the stack grows. I've tried making some visual explanations for them . Here's one for push:

And one for pop:
On these diagrams, each little square represents a byte. The stack begins at the end of the address space and grows towards the start of the address space.

We'll implement push and pop for both word- and byte-sized values. Here's what the implementation looks like for word-sized values:

    ; push word-sized value onto the stack
    rept 4 reg:0 {
         insn pushwq#reg
              sub qsp, 2
              mov word [qbx_mem + qsp + 1], q#reg

    ; pop word-sized value from the stack.
    rept 4 reg:0 {
         insn popwq#reg
            mov q#reg, word [qbx_mem + qsp + 1]
            add qsp, 2            

Writing a Test Program for QBX

It's time for us to finally write a program for QBX to test out the instructions we just implemented. We hardcoded our first program into the machine's memory by putting the corresponding machine code directly into our data section. We'll be hardcoding our programs into memory for a while, but we can make things a bit more convenient.

Create a file called input.asm with the following contents:

include 'qbx_insn_helpers.inc'
include 'qbx_instructions.inc'

qbx_insns define_icodes, 0

dw moviwq0, 0xbeef, \
   moviwq1, 0xfeed, \
   movbq2q1,        \
   movwq2q0,        \
   storbdq1, 1023,  \
   loadbdq2, 1023,  \
   moviwq0, 1022,   \
   storbiq1,        \
   pushwq0,         \
   pushwq1,         \
   popwq2,          \
   popwq3,          \

What's happening here is we're using some of our helper macros to introduce the instruction code constants, and then we're using FASM's data definition directives to write QBX bytecode in a human-friendly way. The syntax is a bit awkward, but the cool thing is that labels, macros and all other FASM features still totally work!

If you compile this file now, it will produce an input.bin that can be directly loaded into QBX memory. For now, since we can't load the file dynamically, we'll include it at assembly time. Here's what our updated data section looks like:

section '.data' data readable writeable
        ; jump table mapping insn codes to insn implementations.
        qbx_insns define_jmp_table, qbx_jmp_table
        ; QBX memory.
        QBX_MEM_SIZE = 1024
        qbx_mem file 'input.bin'
                CODE_SIZE = $ - qbx_mem
                dw QBX_MEM_SIZE - CODE_SIZE dup ?

(Incidentally, FASM can be really helpful if you need to quickly generate any sort of binary file!)

Since our VM has no I/O capabilities yet, the only way we can observe the effects of this program is through the debugger. You can load up your executable in WinDbg, run it to our hardcoded breakpoint, and then single-step it. You should be able to see the changes to the registers in the corresponding window (keep in mind the mapping between QBX registers and x86-64 ones).

Monitoring changes to memory is a bit more tricky. Let's see it on an example. The illustration below shows the state of the debugger just after the instruction pushwq0 was executed.

The command db can be used to examine the raw bytes of memory at a particular virtual address. Note the mov instruction just above the highlighted line in the Disassembly window - this is what writes to memory. The effective address that is used for the write is calculated from a base (which happens to be 0x40209d) and an offset stored in rbx (recall that x86-64 rbx register corresponds to QBX stack pointer!). By adding those two values together, we get the address of the top of QBX stack. If we look at it, we'll see that it contains exactly the bytes from the register that we just pushed (QBX q0 is x64 r12w). The apparently reversed order of bytes is actually due to x86-64 little-endianness.

That's it for this part. In the next installment, we'll learn about arithmetic and logic operations, so stay tuned!

Like this post? Follow this blog on Twitter for more!