Emulating Command Buffers in OpenGL - Part 2

A few months ago I published a post with some musings on emulating command buffers on top of APIs that don't directly support them. The post resulted in some interesting comments on Twitter. It turned out that command buffer emulation has been successfully done in the past, and moreover, it has been known to result in improved performance due to less jumping between driver and application code when actually executing commands from the emulated buffer.

This discussion gave me more confidence in the emulation approach, so I went ahead with the implementation. However, back then there was no good workload to test it against, so I just left it at "seems to work correctly".

Recently, I started thinking about it again, and decided it was a good time to do some poking to understand the performance implications better.

Implementation Details

The design I had in mind was briefly described in the previous post, and it has remained largely the same. A command (such as draw, bind vertex buffer, etc.) is represented by a tagged union. Currently the size of the command structure is at 24 bytes.

There are choices when it comes to how granular we want our emulated commands to be. For example, we could go for a "fat" draw command, which specifies not only the usual draw parameters, but also pretty much all the state required for the draw, including resources that need to be bound. This potentially allows more room for optimization by sorting draw calls optimally in the emulation layer. I was not aiming to do that, however: my goal is to make a relatively thin abstraction for low-level APIs, so I decided to leave sorting to a higher-level entity, and go with more granular commands.

As we record a command buffer, we build up a sequence of commands, and when the buffer is "submitted", the entire sequence is interpreted by calling the relevant API functions (i.e. glDrawElements etc.).

An important aspect of this is memory management. We do not want to incur the full cost of a dynamic memory allocation every time a command is recorded into a buffer. To avoid it, a simple block allocator is used: we preallocate a certain amount of memory, then dole it out in blocks as needed. Each block may contain a sequence of commands of up to a certain fixed length. This helps improve locality, since we won't be jumping around memory too much when interpreting commands. If a command buffer needs more commands than can fit into a block, we simply chain blocks together. Once a command buffer is submitted and the commands have been executed, all memory used by the command buffer gets released back into the pool.

Stress Testing

I wanted to see how this solution would fare in a scenario where many commands are recorded into the buffer in a tight loop. Primarily I was paranoid about emulation adding too much CPU overhead on top of what the OpenGL driver does.

The workload used in this test was simple: render forty-eight thousand textured cubes, using a separate draw call and binding vertex/index buffers and textures for each of them.

I want to emphasize that this is an absolutely terrible way to render forty-eight thousand cubes. Using instancing for this results in vastly better performance, but that's not the point of the test. The point is to strain a certain path in both the command buffer emulation layer and the OpenGL driver, simulating a workload that has lots and lots of drawcalls all of which can't necessarily be instanced.

Even when built in release mode, the test application ran at around 15 frames per second on my system (GTX 970M, core i7 4720HQ). That was about what I expected, the real question was where the slowness was coming from.

One of the CPU cores was experiencing a high load:

But that by itself doesn't mean much. NSight Graphics reveals a more interesting picture:

The yellow line indicates the percentage of time the graphics engine of the GPU has been idle. Note how it's above 70 percent - the GPU is barely doing any work yet the application is limping along at 15 FPS. This is a telltale sign that the application is heavily CPU bound - the graphics hardware is doing work much faster than the application can submit it, so most of the time the GPU sits idle.

This is exactly the scenario I was trying to set up - lots of work being done on the CPU. Now we can look at how it's split between command buffer emulation and OpenGL driver. Visual Studio has a really nice built-in CPU profiler that can help us do that (if you decide to try it, and profile a release build, don't forget to build your application in "release with debug info" configuration!).

Right from the start we can see that the majority of time is being spent in the function that "interprets" the command buffer - that's a bit worrisome.

However, if we drill down, we'll find out that by far the majority of time in that function is being spent in "external code", and the only external code that it calls are the OpenGL functions. This leads me to conclude that the overwhelming majority of the CPU time is spent tied up in the OpenGL driver:

About ~7 percent of profiler samples still end up in emulation code - the submit routine itself and the routine that releases resources back to the pool. To be honest, I was worried about emulation taking a way bigger portion of CPU time than that, so I'm satisfied with these numbers for now. I don't want to spend too much time on optimizing the OpenGL path anyway: when your render loop is CPU-bound like that, you'd be better off using a different API that supports recording real command buffers, in parallel, on different threads (remember the image above with one CPU core doing all the work? yeah, that's what all those newfangled APIs are for). A more friendly interface to these APIs has been what I was trying to do during the past few months, and an OpenGL backend was really more like "training wheels" for me.


Like this post? Follow this blog on Twitter for more!