Hash Tables: In A Cache-Sensitive World? (FIXED)
(FIXED: my "NoMatched" bench was bogus, and actually the same as "Matched" case. Graphs changed, but conclusions the same.)
Everyone loves hash tables; certainly the networking code in the kernel uses them heavily and I do too. But with more work on TDB (which is basically a hash table in a file), I've been re-measuring some of the fundamentals.
We all know that hash tables are cache-unfriendly; a good hash table means you're hitting a "random" bucket every time. Back in my university days this was never mentioned (caches were new and fancy things); the optimal hash table was seen as a prime-numbered sequence of buckets using a primary hash then a secondary hash to jump to the next bucket if this was full. This gives fewer lookups than a naive linear search "if this bucket is full, go to the next one", but does this override the cache unfriendliness of the approach?
These days it's normal to use a fast universal hash and a power-of-two for the hash table size (Perl 5 and the Linux kernel do this, though Perl uses another weaker hash also by Jenkins). Also, we deal with hash collisions by chaining off entries rather than using other hash buckets. This has two main advantages: (1) it doesn't barf when the hash table fills up, and (2) you can easily delete an entry from the hash table. Unfortunately, this is pretty cache-unfriendly too, compared to a linear bucket fallback, but can we measure it?
To this end, I wrote a simple example program which filled each kind of hash table (linear hash bucket, secondary hash bucket and chained hash buckets) to 50%, then ran successful lookups, then ran failed lookups. 64 bit machine, and I chose 2^27 as my hash size giving a 1G hash table. While most hashes are not this large, it's useful to approximate a situation where other things are also using the cache.
This result surprised me, so I fiddled a bit and the basic shape remained; the differences vanished if the table was (say) only 10% full, but even at 25% we see chaining win. Here it's about 15% faster on matched lookups (5% faster on misses).
What does this mean? It looks like linear edges over secondary for insertion, but for lookup it's a wash. But both are overshadowed by chaining, which doesn't penalize other buckets when there's a hash clash. Despite the fact that objects are one pointer larger in the chaining case, it still wins (the test code uses two 64-bit values as the basic object).
But Now For The Twist...
The reason I was looking at old-school non-chained hashing schemes was because I have a new trick for them: stashing some hash bits in the bottom bits of the pointer. This allows you to skip the dereference 7/8 times (assuming 64-bit-aligned pointers).
So, we add that trick to the linear case:
Now we can see the potential of not chaining through hash entries! It's a full 25% faster than the normal linear case, and 10% faster than chaining (almost 20% faster in the missed case). Of course, for real life we need to handle delete and hash table filling. And that's the subject of my next post, once I've finished coding it!