So my previous post showed a trick with stuffing extra bits into the pointers of a simple hash table.Â In most real situations, we want the hash to still be useful as the number of elements increases: the simple hash I used will slow down dramatically as we pass 75% utilization.Â In addition, we want to be able to delete entries, which can be slightly painful in a linear hash.
Since then, I’ve been trying a few different approaches: in fact, I spent quite a while down this rabbit hole before digging myself out.Â Here are the approaches I benchmarked (on a 64-bit system):
- Doubling: Linear hash with extra hash bits in the pointers, doubling and rehashing everything when it hits > 5/8 full.
- Cacheline Tree: Divide the hash into 128-byte groups.Â Do linear collision resolution within each group, and when a group fills explode it into 16 pointers to separate hash tables of N 128-byte groups.Â When that fills, explode again, etc.Â In practice, N was 2.
- Cacheline Tree Then Linked:Â Explode the top levelÂ like the above, but allocate a single cacheline (N=1) for the second level and just fill the second level from first bucket onwards.Â If that fills, simply chain on another cacheline.Â This actually freed up on delete as chains were emptied.
- Cacheline Linked: Even simpler: when the toplevel cacheline fills, just chain another cacheline off it.Â Each cacheline is simply filled in order, but we only put 15 pointers in each cacheline and use the extra 4 bits per entry for additional hash bits (to reduce comparisons).Â This also freed them up on delete.
- Chaining: Double linked list through each object: when a bucket fills we simply append the new entry to the existing ones.
Then I benchmarked a 2M-bucket (actually 2097152) hashtable of each type, increasing the number of entries by a factor of 2 at a time from 1M.Â The good thing about visiting the Canberra office is I could use one of theÂ older Power machines here with gobs of memory.
The first three tests were very similar to the last post, except I used an array of scrambled pointers so I wasn’t accessing the objects in memory order.Â The last two are new:
- Insert all the objects into the hash.
- Find all the objects in the hash.
- Do the same number of missed searches in the hash.
- Delete half the objects, then time reinserting them (steady-state test which avoids any allocation penalties for expansion).
- Delete every object from the hash.
Our simple link of cachelines (CLT Linked) gets so slow I took it out of the benchmark after 8x overload. Chained is the winner: we expect it to modify the inserted entry, the hash bucket and the current top entry for a total of three cache misses (and the object being inserted would generally be hot). It also never needs to (re)alloc or rehash.
We can see the CLT-then-linked case is getting worse as we fall into the chain and have to linear search for the entry to place into; there are 16 times as many of these chains as in the dumb “link of cache lines” case, because we have a link off each pointer off each toplevel hash entry, rather than one off each cacheline.
If we look at reinserts, we can remove the realloc and rehash overhead:
In steady-state we can see the others are more competitive, though even our flat linear hash loses a little as it scans for the next NULL bucket.
The optimal reasonable case would be one pointer per object; 8 bytes. The two linked cases quickly approach this (at which point they suck). The chaining cases uses two pointers; doubling uses 2 pointers for these values too (it can do slightly better just before it doubles, and worse just afterwards).
The real story is the cacheline tree. It hits a tipping point as buckets fill: that 512-byte second level off each top level pointer means we can spend 131 bytes per object just after we fill the top level. At 8 million objects in our 1 million buckets we are in fact spending 131 bytes per object (vs 17 for the chained case).
This is why I reduced the second level hash size to two cachelines; and we still get that factor of 16 explosion on each expansion. In fact, the chaining hashes were an attempt to find a “good enough” variant which didn’t explode like this. The chaining variant (CLT then linked) explodes once, to 65 bytes per entry, but then expands linearly.
This is where the cache damage of chaining becomes apparent. Doubling performs best, and the tree is close behind. Even a linked list of cachelines beats chaining: with 15 entries per cacheline on average 13 of those will not need to be examined (since the lower pointer bits where we’ve stashed another 3 hash bits will only accidentally match for two of them). This means 1/5 of the cache misses of the chained case for really large overload values.
By 64 entries per bucket (128M objects) chaining is 16 times slower than doubling for a successful lookup, and 20 times slower for an unsuccessful one. Even at 8 per bucket, it matches our terrible chained case at being 3 times slower than the linear case.
Since I assume most hashes are lookup heavy, this result should be weighted more than add or delete.
Chained delete is cheap: you might get three cache misses (one for the object you’re deleting, and one for the next and prev objects in the linked list) but you don’t need to do a hash lookup at all. The linear hash has a cache miss to cache the object, then a cache miss to do the delete, but then it has to rehash any objects in immediately following hash buckets. It’s 2.2 times slower at 64x overload. We could use a special value to avoid this.
Cacheline tree does OK, but still needs to hash and similarly reseat any “runs” within the bottom cacheline. At three levels, add cache misses for the second level and third levels. 3.1 times slower than chained at 64x overload.
I’d love to spend more time refining these results. In particular, there are ways to steal more bits to store extra hashing, ways to avoid reseating entries on linear deletes, ways to avoid rehashing to double the linear case. Compression of pointers is also possible, but it gets messy since there are a no longer a fixed number of entries per cacheline. I might get time for some of this, but at some point you have to get on with less abstract work!
But the takeaway is this: it’s hard to beat the simplicity of a big hash array which doubles and stashes a few extra hash bits. If your workload is add/delete heavy, use chaining. If you are worried about unknown work sizes, use doubling. Or do what Perl does, and do both (though you can’t use the extra hash bit stowing with chaining).
The cache tree idea is not best at anything, unless you can’t double for some reason (the Linux kernel probably has this issue, and TDB may or may not).
Here’s another variant for you, if you’ve still got the interest: instead of calling the hash function every now and then, store the 32-bit hash value per slot (provided by the caller). This comes out to 8 bytes on a 32-bit architecture, 8 slots per cacheline. Even on a 64-bit machine it’s possible to mitigate the increase in data by issuing prefetches for the next two cachelines in the table after the one that’s going to be accessed anyhow.
Lookup becomes a “keyhash == t[k].hash && (*equalfn)(key, t[k].key)” test in a loop. Also, using a “this slot was deleted” marker, in the key pointer field, in linear collision handling is a good idea; if you also use a one-bit-distinct magic value for “this slot has not been used” marker, you can test whether a slot is free with a mask and a compare.
Thanks! I used a variant of this for the (otherwise fatally flawed) naive chaining case. 15 pointers per cacheline, then 4 bits each at the end. That gives a total of 7 hash bits for each pointer, implying < 1% false compare rate.
I'll see what this does to the results when integrated in the linear case; same with a deleted marker. The results will feed into ccan, if nothing else…
Very impressive results.
Why do you use a doubly-linked list for chaining, rather than a singly-linked list? That would probably improve chaining in several of your use cases.
Yep, but you lose O(1) delete with a single-linked list. Perl and the kernel use a doubly-linked list, so it seemed fair. We’d save a pointer per object, at cost of delete being worse than the flat case (1 + objects-per-bucket / 2 cache misses).
Why would delete from single-linked list be slower than O(1)?
Compared to double linked list one has to just keep track of the previous node on the stack while scanning for the node to delete.
Or to reuse the find code in delete the core find method must return always the previous entry of the match. Then delete can do it’s link manipulation directly and the api find just returns always the next entry that the common find method returns.
Sorry, perhaps I was unclear. The delete time is measured assuming one already has the object via some other method; it could be find() then del(), but not necessarily. ie. as you suggest, we would do a find() inside del() without the double-linked list. That’s O(length-of-list).
Comments are closed.