After my previous conclusion that doubling linear hash tables were good enough, it turns out that they don't work for TDB: enlarging or moving the hash table in the presence of tdb_chainlock is not possible.  So, I ended up implementing expanding hash tables, and ran into a familiar set of problems.

The design is simple; we start with a simple hash table.  When it gets too full, we expand a bucket to point to a sub-hashtable, which we similarly expand to sub-tables when it fills.  Seems like a nice logarithmic expanding data structure that means if we use 8 bit hash tables (ie. 256  entries) we'll fit well over a billion entries in a 4 level deep set of tables.

Yet as we discovered previously, the hash density is horrible.  To see why, imagine our entries hit the hash buckets in order: the first entry belongs in hash bucket 0, the next in hash bucket 1... it's all good for the first 256 entries, and our single top-level hash table ends up with 100% occupancy.  However, the 257th entry collides in bucket 0, causing it to expand into a 256-entry subtable which only has two entries in it.  By the time we have 512 entries, we've got a top-level hash table full of pointers to 256 sub-hash tables; that's a total of 68352 hash buckets holding a mere 512 entries.  That's 0.75% density, or an overhead just over 1k per entry!

In practice our arrangement is not perfectly even, so we don't get as low as 0.75%, but we certainly don't get up to 100% either.  Our density wavers between a minimum of 1.3% and a maximum of 4.5% (I simulated this; if someone has pointers to a good "Hash Statistics For Beginners" course please tell me!).  Our average density sits roughly halfway between those two, so I'll be quoting min and max values from here in.

So, how do we improve this?  The first thing I tried is to organize our hash into groups: we only expand when we fill an entire group.[1].  Say we use 8-entry groups; if your bucket is full, you try the next one in the group.  If all 8 are full, then you expand all the buckets into 8 new sub-tables.  At expense of a slightly longer search time (we now have to search up to 8 buckets at the bottom level, rather than just 1), this means we tend to be fuller before we expand.  This raises the maximum density to a respectable 25.3%, though the minimum drops to 0.6%.  Still, our average density has jumped from 3% to 13%.

So, what if we don't expand all 8 buckets at once, but expand the most popular one in the group?  Our minimum density rises to 1.5%, and our maximum to 33%; closer to a 17% average.  Unfortunately calculating the most popular bucket means re-calculating the hash of all entries in the group.  With the expand-all approach, we needed to recalculate hashes for 8 entries to decide into which subtable each entry should be inserted.  With the expand-one approach, we calculate hashes for 8 entries, then 7 next time, then 6... 36 calculations (ie. O(N^2)).

We do have the hash value of the current entry: what if we simply expand the bucket that belongs in?  It may not be the fullest bucket, but we know  it's not going to be empty.   This gives a close-to-optimal result, with densities ranging from 1.2% to 31%.

Now note that with all these approaches, expanding a bucket to point to a subtable means messing with a full group: some entries may go down into the subtable, others may move within group.  So we need to re-calculate the hashes anyway unless we store some extra information in the hash bucket somewhere.  For TDB2, we don't really need 64 bits for entries: 64 exabytes should be enough for anyone, so the top 8 bits are free, and all records are 8-byte aligned so the bottom 3 bits are free.

There are two approaches which seem useful: store the "home bucket" in unused bits in the offset (3 bits) which means we never have to rehash when we expand a bucket (unless moving a hash entry down a level), or store a single "happy where I am" bit (around half of entries in full groups are in exactly the right location, so knowing this saves us half our rehash calculations).  It depends on how many bits we want to spend.

But 15% average density still sucks.  If we increase the group size it slows down searches a little and requires more rehashing on expansion or more bits to indicate the home bucket.  For example, 32-entry groups using "create subtable where we want to insert the new entry" gets us 1.2% to 60% density.  Yet that means storing the "home bucket" would take 5 bits.

That minimum density is still a worry.  The obvious change is to reduce our subtable size: say 64, not 256 entries for sub-levels[2].  Of course, our tree becomes deeper, making us less efficient, so we don't really want to do that.

Is there anything else we can do to increase that minimum density?  The answer is yes, but it's not free: we can share subtables.  For example, when a group fills, we expand all group entries to point at the same (new) subtable.  When a group in that subtable fills, we split the subtable.  In effect, we expand a table sideways before we expand down a level.

In the simplest case, we split the shared subtable back into one-subtable-per-bucket immediately when any group fills.  Going back to our 256-entry hash tables with groups of 32, this gives a density range of 2.6% to 61%. This has raised the minimum density by a factor of 2.  But that expansion from a single shared subtable to 32 subtables is costing us, so I tried an alternative where instead  of unsharing immediately, we split into two.  So we start with all 32 buckets in the group pointing at the same subtable, then split so 16 buckets are pointing at each of two subtables, until eventually each one has a unique subtable, which then expands down.  The result is a density range from 13% to 61%.  This last result is ten times the minimum on our unshared 32-entry groups.

Of course, unsharing has costs as well: again, we need to rehash to figure out which of the subtables each entry belong in.  And again, doing it all at once is cheaper than doing it five times.  And once more, we can use 5 additional bits to say which entry of the group above this entry belongs to.  Once we're in an unshared subtable however, these bits are useless (and on average we will be in an unshared subtable half the time).  That's a fair cost for something which has marginal effect on our average density; it just helps our worst-case.

This entire decision comes down to expansion cost. This is not the same as insertion cost; not all insertions cause expansion, and if the system is in steady state (we don't ever shrink the hash) expansion is rare.  We need a model for our usage to know how often we expand; we also need to know how often we delete, insert and search (both successful and failing searches).

To wrap up, here is how we can use any extra bits in our hash entries:

Num Bits Name Benefits
1 Subtable Bit Bucket points to subtable, not an actual entry.
Any Extra Hash Bits Avoids lookups on hash collisions: each bit halves number of lookups. May avoid hash recalculation when pushing entries down into subtable.
1 Hash Truncated Bit Indicates fewer/no Extra Hash Bits are valid. Otherwise we always need to rehash when we push entries down a sublevel, to fill in the Extra Hash bits. This way it can be done as required.
log2(groupsize) Home Bucket Avoid rehashing on expanding a bucket to a subtable,
for entries not going down to that subtable.
Helps searching on collisions.
1 In Home Bucket Alternative to above: avoids rehashing about 50% of buckets, and helps searching on collisions.
log2(groupsize) Shared Bucket Indicate which subtable we belong to in a shared subtable. With one of the two above, avoids rehashing when splitting shared subtable. Helps searching to a lesser extent (only on shared subtable).
1 Deleted Bit Avoids reshuffle on delete, but may slow lookups a little.
1 Collision Bit Indicates we should keep searching. Speeds lookups.

Feedback welcome. If you want to play with my simulator, you can find it here.

[1] It also means that two identical hash values don't create an infinite set of subtables.  In practice, we need to use a linked list if we run out of hash bits anyway.  My test code has an abort() in this case.

[2] With a hash size of 64, our minimum density rises to 5.8%, maximum 42% with 8-entry groups (with 32-entry groups the density range is 6% to 66%).

We can use some bits in each entry to indicate what its "home bucket" would be, but that has a cost too.