Prematurely hand-optimizing C++ code for shits and giggles
Left in the void that comes after abandoning a project, I decided to port a previous JavaScript / WebGL project to C++. Changing languages meant a complete overhaul of some core systems. I was looking at a post by Austin Morlan about ECS achitectures for inspiration.
Here I saw this procedure:
()
EntityManager{
// Initialize the queue with all possible entity IDs
for (Entity entity = 0; entity < MAX_ENTITIES; ++entity)
{
.push(entity);
mAvailableEntities}
}
this isn’t bad code. Nonetheless, I did not like it. And not just because I have a distaste for CamelCase.
Initializing the value mAvailableEntities
potentially
takes a long time. In the example MAX_ENTITIES
was set to
5k, so it’s not so bad, but I may require far more entities than
that.
Had this been a regular project with a clear goal, the correct approach would be to leave the code as is, since it’s certainly readable. Then if we run into performance issues in practice, find the procedures taking up the most times and optimize those, and use statistical analysis to determine if our changes actually mattered. That is the correct way to optimize performance. That’s not what I’m going to do here.
In this particular architecture, mAvailableEntities
is a
pool of all available entities. That is why it must be filled.
We cannot use a simple incrementing counter because entities can be
destroyed as well, in which case we’d want to recycle them.
Instead, we can combine the incrementing counter and the recycling pool. If the pool is not empty, grab an element from there, otherwise, use the counter. Unfortunately, that sentence contains the word “if”.
Naive implementation of the above is:
#include <queue>
#include <cstdint>
using entity = std::uint32_t;
= 0;
entity counter std::queue<entity> recycling_pool;
() {
entity create_entityif (recycling_pool.empty()) {
return counter++;
} else {
auto e = recycling_pool.front();
.pop();
recycling_poolreturn e;
}
}
(I’ve left out the check to prevent going over
MAX_ENTITIES
)
and as we see on compile explorer, this compiles to conditional jumps (jne):
create_entity():16(%rip), %rax
movq recyclingPool+48(%rip), %rax
cmpq recyclingPool+%r12
pushq
jne .L9 ...
conditional jumps can be a performance problem.
Modern CPUs have deep pipelines. Rather than fully executing one
instruction and then moving on to the next one, they execute every
instruction in multiple steps like an assembly line. While instruction
n
is going through step x
, instruction
n-1
(the instruction that came before n
) is
going through step x+1
and n+1
is going
through step n-1
. A pipeline of 20 steps is essentially
doing 20 things at the same time and therefore ~20x as fast as it would
be executing instructions one-by-one.
These pipelines require knowing ahead of time which instructions are coming up, which is uncertain with conditional jumps. Branch prediction works in some cases, such as loops, but not always.
So how do we get rid of branching?
The concept is conditional write-back. Rather than pick one branch or another, we do all branches, but don’t write back the result in some cases.
C++ does not support conditional writeback directly, so we improvise a bit.
First, replace std::queue
with a stack, and just use a
primitive array entity recycling_pool[MAX_ENTITIES]
. This
will be a bit easier to hack.
Then, we specify entity 0 to be invalid. We initialize
recycling_pool
with a 0
in the first element.
Now we can take the head of recycling_pool
and it will
function both as condition and value. This is rather like
Maybe Entity
in e.g. Haskell, but more efficient and far
less safe.
()
entity create_entity{
= recycling_pool[unused_idx];
entity new_entity bool b_stack_empty = !new_entity;
...
}
Now we have divergence: if not b_stack_empty
, we want to
pop
the stack, otherwise, we want to increment the
counter.
b_stack_empty
is a boolean with the value 0 or 1, since
it was generated from the !
operation (anything that isn’t
0 counts as true in C++, but true as output is 1). we can abuse
this:
unused_idx -= !b_stack_empty;
gen_counter += b_stack_empty;
(NOTE: gen_counter
starts at 0, but valid values start
at 1, so it’s increment-then-use)
this is a specialized form of conditional write-back.
Then, if b_stack_empty
, we want to replace
new_entity
, which will be an all-0 bits invalid value, with
the current value of gen_counter
:
+= -b_stack_empty & gen_counter; new_entity
since b_stack_empty
can have values 0 or 1, a non empty
stack will evaluate to 0 & gen_counter
, which equals 0,
whereas an empty stack will give -1 & gen_counter
and
since we’re dealing with unsigned integers, -1 is all 1’s in
binary, so -1 & gen_counter
equals
gen_counter
. In the latter case we also know
new_entity = 0
, and
0 + gen_counter = gen_counter
.
And so the full code (minus checking MAX_ENTITIES violation) is:
#include <cstdint>
using entity = std::uint32_t;
;
entity unused_idx;
entity gen_counterconst entity MAX_ENTITIES = 5000;
[MAX_ENTITIES];
entity recycling_pool
()
entity create_entity{
= recycling_pool[unused_idx];
entity new_entity bool b_stack_empty = !new_entity;
-= !b_stack_empty;
unused_idx += b_stack_empty;
gen_counter += -b_stack_empty & gen_counter;
new_entity return new_entity;
}
Compiler explorer confirms there are no conditional jumps.
The resulting assembly is also a lot smaller (24 lines vs 103), but I
expect that is mostly due to the omission of
std::queue
.
I want to reiterate that this is dumb. The code is likely to have gotten slower in practice, even though it is more consistent, because branch prediction is probably right more often than not, and now we’re executing all the branches. Code maintainability has also suffered.
This is merely an interesting learning experience.