Prematurely hand-optimizing C++ code for shits and giggles

Left in the void that comes after abandoning a project, I decided to port a previous JavaScript / WebGL project to C++. Changing languages meant a complete overhaul of some core systems. I was looking at a post by Austin Morlan about ECS achitectures for inspiration.

Here I saw this procedure:

EntityManager()
{
  // Initialize the queue with all possible entity IDs
  for (Entity entity = 0; entity < MAX_ENTITIES; ++entity)
  {
    mAvailableEntities.push(entity);
  }
}

this isn’t bad code. Nonetheless, I did not like it. And not just because I have a distaste for CamelCase.

Initializing the value mAvailableEntities potentially takes a long time. In the example MAX_ENTITIES was set to 5k, so it’s not so bad, but I may require far more entities than that.

Had this been a regular project with a clear goal, the correct approach would be to leave the code as is, since it’s certainly readable. Then if we run into performance issues in practice, find the procedures taking up the most times and optimize those, and use statistical analysis to determine if our changes actually mattered. That is the correct way to optimize performance. That’s not what I’m going to do here.

In this particular architecture, mAvailableEntities is a pool of all available entities. That is why it must be filled. We cannot use a simple incrementing counter because entities can be destroyed as well, in which case we’d want to recycle them.

Instead, we can combine the incrementing counter and the recycling pool. If the pool is not empty, grab an element from there, otherwise, use the counter. Unfortunately, that sentence contains the word “if”.

Naive implementation of the above is:

#include <queue>
#include <cstdint>

using entity = std::uint32_t;
entity counter = 0;
std::queue<entity> recycling_pool;

entity create_entity() {
  if (recycling_pool.empty()) {
    return counter++;
  } else {
    auto e = recycling_pool.front();
    recycling_pool.pop();
    return e;
  }
}

(I’ve left out the check to prevent going over MAX_ENTITIES)

and as we see on compile explorer, this compiles to conditional jumps (jne):

create_entity():
        movq    recyclingPool+16(%rip), %rax
        cmpq    recyclingPool+48(%rip), %rax
        pushq   %r12
        jne     .L9
...

conditional jumps can be a performance problem.

Modern CPUs have deep pipelines. Rather than fully executing one instruction and then moving on to the next one, they execute every instruction in multiple steps like an assembly line. While instruction n is going through step x, instruction n-1 (the instruction that came before n) is going through step x+1 and n+1 is going through step n-1. A pipeline of 20 steps is essentially doing 20 things at the same time and therefore ~20x as fast as it would be executing instructions one-by-one.

These pipelines require knowing ahead of time which instructions are coming up, which is uncertain with conditional jumps. Branch prediction works in some cases, such as loops, but not always.

So how do we get rid of branching?

The concept is conditional write-back. Rather than pick one branch or another, we do all branches, but don’t write back the result in some cases.

C++ does not support conditional writeback directly, so we improvise a bit.

First, replace std::queue with a stack, and just use a primitive array entity recycling_pool[MAX_ENTITIES]. This will be a bit easier to hack.

Then, we specify entity 0 to be invalid. We initialize recycling_pool with a 0 in the first element. Now we can take the head of recycling_pool and it will function both as condition and value. This is rather like Maybe Entity in e.g. Haskell, but more efficient and far less safe.

entity create_entity()
{
  entity new_entity = recycling_pool[unused_idx];
  bool b_stack_empty = !new_entity;
...
}

Now we have divergence: if not b_stack_empty, we want to pop the stack, otherwise, we want to increment the counter.

b_stack_empty is a boolean with the value 0 or 1, since it was generated from the ! operation (anything that isn’t 0 counts as true in C++, but true as output is 1). we can abuse this:

  unused_idx -= !b_stack_empty;
  gen_counter += b_stack_empty;

(NOTE: gen_counter starts at 0, but valid values start at 1, so it’s increment-then-use)

this is a specialized form of conditional write-back.

Then, if b_stack_empty, we want to replace new_entity, which will be an all-0 bits invalid value, with the current value of gen_counter:

  new_entity += -b_stack_empty & gen_counter;

since b_stack_empty can have values 0 or 1, a non empty stack will evaluate to 0 & gen_counter, which equals 0, whereas an empty stack will give -1 & gen_counter and since we’re dealing with unsigned integers, -1 is all 1’s in binary, so -1 & gen_counter equals gen_counter. In the latter case we also know new_entity = 0, and 0 + gen_counter = gen_counter.

And so the full code (minus checking MAX_ENTITIES violation) is:

#include <cstdint>

using entity = std::uint32_t;
entity unused_idx;
entity gen_counter;
const entity MAX_ENTITIES = 5000;
entity recycling_pool[MAX_ENTITIES];

entity create_entity()
{
  entity new_entity = recycling_pool[unused_idx];
  bool b_stack_empty = !new_entity;
  unused_idx -= !b_stack_empty;
  gen_counter += b_stack_empty;
  new_entity += -b_stack_empty & gen_counter;
  return new_entity;
}

Compiler explorer confirms there are no conditional jumps.

The resulting assembly is also a lot smaller (24 lines vs 103), but I expect that is mostly due to the omission of std::queue.

I want to reiterate that this is dumb. The code is likely to have gotten slower in practice, even though it is more consistent, because branch prediction is probably right more often than not, and now we’re executing all the branches. Code maintainability has also suffered.

This is merely an interesting learning experience.