My Most Important Project Was a Bytecode Interpreter

In online forums, I often see questions from people new to C++ asking for recommended reading. I usually respond with a certain set of trusted books, with the caveat that no amount of reading makes up for practice: you have to actually work on something. But what would be a good project to work on? Something that could teach you a lot, yet simple and interesting enough to keep you motivated? I've thought about it recently, and I think I got it. You should definitely write a bytecode interpreter. For me, doing it was a decisive moment that helped shape my subsequent career.

How It Started

The year 200X was my second year at the university. By then I've already had a fair bit of exposure to programming. I was able to use the abstractions provided by C++, but didn't really understand how it all worked. To me, the compiler and the OS were just black boxes that followed my magical incantations, and I sort of accepted it.

My lack of knowledge didn't deter me from being a vocal participant in programming language wars. One of such many-page threads led me to reading some stuff about the Java Virtual Machine, and discovering stack-based architecture.

My understanding of the x86 architeture was super vague, and I had no exposure to any other architectures at all. The idea of a machine with no registers seemed pretty interesting and novel. I thought about it so much that one day I decided to sit down and write my own simple stack-based VM.

What It Looked Like

The Stupid Virtual Machine (or SVM for short) took the most straightforward approach possible. The machine's word size was 32 bits, and the memory was word-addressable (you couldn't address individual bytes). The areas of memory for the program code and data were completely separate from each other (later I learned that it is the defining characteristic of Harvard architecture). Even the stack was in its own separate area of memory.

The instruction set was also pretty simple. It had your standard arithmetic and logic, memory, stack manipulation and jump instructions, which functioned in the way one would expect. For example, the ADD instruction would take the top 2 32-bit values from the stack, add them as signed integers and push the result onto the stack.

The input/output was primitive, wired to stdin/stdout. There were IN and OUT instructions. The former would push the result of the read onto the stack, the latter would print the top of the stack. For convenience, there was a special flag that controlled whether the input was treated as raw bytes or as a string representation of a signed integer.

What Programming It Was Like

Initially, I wrote all programs for SVM in raw machine code, with a hex editor. That got old fast, so I wrote an assembler with support for labels and string literals. For example, here's what "Hello, World" looked like.

"Hello, World!\n"

When the assembler encounters a string literal, it pushes each byte onto the stack. PRINT isn't an instruction, it's really an assembler macro that generates a loop that prints each character on the stack until a 0 is encountered.

Writing and reading stack machine code feels weird. For a slightly more advanced example, here's what calculating the greatest common divisor looked like:

IN ; read integer "A" from standard input
IN ; read integer "B" from standard input

:GCD ; this is a label
DUP ; if B is 0 then A is the gcd
0 ; (immediate values get pushed on the stack)
@END ; (this is how you put the address of a label onto the stack)
JE ; (this will jump to the address at top of stack if the preceding two values are equal)
SWP ; if B is not 0 then the result is gcd(B, A modulo B)
JMP ; recursion!

POP ; remove 0 from top of stack 
OUT ; now the result is at the top, print it.

This demonstrates the use of labels and conditional jumps. The inline comments explain what exactly is going on.

If you want an even more advanced example, you're welcome to read assembly code for insertion sort, though I wouldn't blame you if you decide to skip that - it's a bit long, which is why I'm not including it directly in the post.

If you're interested, you can take a gander at the code of the VM and the assembler that I dug up from old files. It's nothing special though, and, if anything, it should be an example of how NOT to write bytecode interpreters. Stack-based VMs need some tricks to perform on par with their register-based counterparts. Of course I didn't know any of that, and took the most naive approach which didn't do any favors for performance.

Why You Should Do It Too

Writing even a naive a bytecode interpreter and a bunch of programs for it makes you think about how things that you usually take for granted actually operate under the hood.

For example, when I started thinking about implementing procedures in SVM, I realized that a function call is nothing more than a jump with an additional set of rules that the caller and callee functions implicitly agree on. This made me understand the concept of calling convention and magic incantations like _cdecl and WINAPI started making sense.

Sure, doing this type of project won't teach you all the intricacies of C++ or C, but it helps you get in the right mindset. Back then, I used to have a certain mental barrier that made me look at unfamiliar things as black boxes full of magic that couldn't be touched. Shattering that barrier is important if you're going to be doing any kind of programming, and especially if you want to use low-ish level languages like C++ or C. This controlled exposure to "simulated low-level" programming trained me to not be afraid of segfaults and looking at disassembly. It has done me a huge service in the years that followed, and that's why I consider it one of the most important projects I did.

Some More Ideas

  • To kick things up a notch, you could implement some graphical capabilities in your VM. You could even write some simple games for it!
  • Try writing a compiler from a simple high-level language into your bytecode. Unfortunately, I never got around to writing a compiler for SVM (instead, I ended up writing an x86 assembly generating compiler later for a compilers course at the university). That would be a great exercise that requires a bit more theoretical knowledge.
  • Making your interpreter run as fast as possible is another interesting, but challenging task. It's not that easy.

Like this post? Follow this blog on Twitter for more!