## Back To Cacheline Tree Hash Tables

After my previous conclusion that doubling linear hash tables were good enough, it turns out that they don’t work for TDB: enlarging or moving the hash table in the presence of tdb_chainlock is not possible.Â  So, I ended up implementing expanding hash tables, and ran into a familiar set of problems.

The design is simple; we start with a simple hash table.Â  When it gets too full, we expand a bucket to point to a sub-hashtable, which we similarly expand to sub-tables when it fills.Â  Seems like a nice logarithmic expanding data structure that means if we use 8 bit hash tables (ie. 256Â  entries) we’ll fit well over a billion entries in a 4 level deep set of tables.

Yet as we discovered previously, the hash density is horrible.Â  To see why, imagine our entries hit the hash buckets in order: the first entry belongs in hash bucket 0, the next in hash bucket 1… it’s all good for the first 256 entries, and our single top-level hash table ends up with 100% occupancy.Â  However, the 257th entry collides in bucket 0, causing it to expand into a 256-entry subtable which only has two entries in it.Â  By the time we have 512 entries, we’ve got a top-level hash table full of pointers to 256 sub-hash tables; that’s a total of 68352 hash buckets holding a mere 512 entries.Â  That’s 0.75% density, or an overhead just over 1k per entry!

In practice our arrangement is not perfectly even, so we don’t get as low as 0.75%, but we certainly don’t get up to 100% either.Â  Our density wavers between a minimum of 1.3% and a maximum of 4.5% (I simulated this; if someone has pointers to a good “Hash Statistics For Beginners” course please tell me!).Â  Our average density sits roughly halfway between those two, so I’ll be quoting min and max values from here in.

So, how do we improve this?Â  The first thing I tried is to organize our hash into groups: we only expand when we fill an entire group.[1].Â  Say we use 8-entry groups; if your bucket is full, you try the next one in the group.Â  If all 8 are full, then you expand all the buckets into 8 new sub-tables.Â  At expense of a slightly longer search time (we now have to search up to 8 buckets at the bottom level, rather than just 1), this means we tend to be fuller before we expand.Â  This raises the maximum density to a respectable 25.3%, though the minimum drops to 0.6%.Â  Still, our average density has jumped from 3% to 13%.

So, what if we don’t expand all 8 buckets at once, but expand the most popular one in the group?Â  Our minimum density rises to 1.5%, and our maximum to 33%; closer to a 17% average.Â  Unfortunately calculating the most popular bucket means re-calculating the hash of all entries in the group.Â  With the expand-all approach, we needed to recalculate hashes for 8 entries to decide into which subtable each entry should be inserted.Â  With the expand-one approach, we calculate hashes for 8 entries, then 7 next time, then 6… 36 calculations (ie. O(N^2)).

We do have the hash value of the current entry: what if we simply expand the bucket that belongs in?Â  It may not be the fullest bucket, but we knowÂ  it’s not going to be empty.Â Â  This gives a close-to-optimal result, with densities ranging from 1.2% to 31%.

Now note that with all these approaches, expanding a bucket to point to a subtable means messing with a full group: some entries may go down into the subtable, others may move within group.Â  So we need to re-calculate the hashes anyway unless we store some extra information in the hash bucket somewhere.Â  For TDB2, we don’t really need 64 bits for entries: 64 exabytes should be enough for anyone, so the top 8 bits are free, and all records are 8-byte aligned so the bottom 3 bits are free.

There are two approaches which seem useful: store the “home bucket” in unused bits in the offset (3 bits) which means we never have to rehash when we expand a bucket (unless moving a hash entry down a level),Â or store a single “happy where I am” bit (around half of entries in full groups are in exactly the right location, so knowing this saves us half our rehash calculations).Â  It depends on how many bits we want to spend.

But 15% average density still sucks.Â  If we increase the group size it slows down searches a little and requires more rehashing on expansion or more bits to indicate the home bucket.Â  For example, 32-entry groups using “create subtable where we want to insert the new entry” gets us 1.2% to 60% density.Â  Yet that means storing the “home bucket” would take 5 bits.

That minimum density is still a worry.Â  The obvious change is to reduce our subtable size: say 64, not 256 entries for sub-levels[2].Â  Of course, our tree becomes deeper, making us less efficient, so we don’t really want to do that.

Is there anything else we can do to increase that minimum density?Â  The answer is yes, but it’s not free: we can share subtables.Â  For example, when a group fills, we expand all group entries to point at the same (new) subtable.Â  When a group in that subtable fills, we split the subtable.Â  In effect, we expand a table sideways before we expand down a level.

In the simplest case, we split the shared subtable back into one-subtable-per-bucket immediately when any group fills.Â  Going back to our 256-entry hash tables with groups of 32, this gives a density range of 2.6% to 61%. This has raised the minimum density by a factor of 2.Â  But that expansion from a single shared subtable to 32 subtables is costing us, so I tried an alternative where insteadÂ  of unsharing immediately, we split into two.Â  So we start with all 32 buckets in the group pointing at the same subtable, then split so 16 buckets are pointing at each of two subtables, until eventually each one has a unique subtable, which then expands down.Â  The result is a density range from 13% to 61%.Â  This last result is ten times the minimum on our unshared 32-entry groups.

Of course, unsharing has costs as well: again, we need to rehash to figure out which of the subtables each entry belong in.Â  And again, doing it all at once is cheaper than doing it five times.Â  And once more, we can use 5 additional bits to say which entry of the group above this entry belongs to.Â  Once we’re in an unshared subtable however, these bits are useless (and on average we will be in an unshared subtable half the time).Â  That’s a fair cost for something which has marginal effect on our average density; it just helps our worst-case.

This entire decision comes down to expansion cost. This is not the same as insertion cost; not all insertions cause expansion, and if the system is in steady state (we don’t ever shrink the hash) expansion is rare.Â  We need a model for our usage to know how often we expand; we also need to know how often we delete, insert and search (both successful and failing searches).

To wrap up, here is how we can use any extra bits in our hash entries:

Num Bits Name Benefits
1 Subtable Bit Bucket points to subtable, not an actual entry.
Any Extra Hash Bits Avoids lookups on hash collisions: each bit halves number of lookups. May avoid hash recalculation when pushing entries down into subtable.
1 Hash Truncated Bit Indicates fewer/no Extra Hash Bits are valid. Otherwise we always need to rehash when we push entries down a sublevel, to fill in the Extra Hash bits. This way it can be done as required.
log2(groupsize) Home Bucket Avoid rehashing on expanding a bucket to a subtable,
for entries not going down to that subtable.
Helps searching on collisions.
1 In Home Bucket Alternative to above: avoids rehashing about 50% of buckets, and helps searching on collisions.
log2(groupsize) Shared Bucket Indicate which subtable we belong to in a shared subtable. With one of the two above, avoids rehashing when splitting shared subtable. Helps searching to a lesser extent (only on shared subtable).
1 Deleted Bit Avoids reshuffle on delete, but may slow lookups a little.
1 Collision Bit Indicates we should keep searching. Speeds lookups.

Feedback welcome. If you want to play with my simulator, you can find it here.

[1] It also means that two identical hash values don’t create an infinite set of subtables.Â  In practice, we need to use a linked list if we run out of hash bits anyway.Â  My test code has an abort() in this case.

[2] With a hash size of 64, our minimum density rises to 5.8%, maximum 42% with 8-entry groups (with 32-entry groups the density range is 6% to 66%).

We can use some bits in each entry to indicate what its “home bucket” would be, but that has a cost too.

## fcntl lock starvation and TDB

The Trivial DataBase (ccan variant here) uses fcntl locks for consistency: records are chained off a fixed-size hash table (or the free list), and a 1-byte fcntl lock at the offset of the chain head protects all records in that chain.

There’s also a tdb_lockall() function which grabs a lock across all the hash chains at once to let you do arbitrary atomic munging.Â  This works because fcntl() locks have an offset and length: you can lock arbitrary byte ranges.

Unfortunately, tdb_lockall() is subject to starvation, at least under Linux.Â  This is because the kernel merely checks whether a lock is available and gets it if it can, rather than queuing behind someone else who wants a superset of the lock.Â  So single byte lockers come in and out while the whole-file locker waits for them all to go away.

I’m not sure this is wrong, and as it’s simpler than the alternative, I’m not prepared to change it just yet.Â  Fortunately, there’s a way of avoiding this starvation in userspace, which occurred independently to both Volker Lendecke and me.Â  I called this variant tdb_lockall_gradual(), in which we try to lock each chain one at a time so we compete with the single-chain lockers on fair terms.Â  My first naive thought was to try to lock all the chains one at a time in order, nonblocking, then go back and retry (blocking) any we failed to get.Â  This is slow, and can deadlock against another process doing the same thing.Â  Volker’s suggestion was much nicer: we do a non-blocking lock, and if that fails we divide and conquer.Â  If we get down to a single record, we do a blocking lock.

I wroteÂ  a test program which fires off N children, each of which grabs a random chain lock for 50-150 milliseconds before sleeping for 1 second, then repeating. The parent waits for a second, then tries to do a tdb_lockall() or tdb_lockall_gradual() depending on the commandline.Â  Running it five times and showing the average wait time for the lockall gives this:

Now, regarding performance.Â  While there can be 10000 hash chains, this isn’t as bad as it sounds.Â  The fast case is the uncontended one, and that’s as fast as before, and the contended case is already slow.Â  I annotated the source to print out how many blocking/nonblocking locks it’s actually doing.Â  Inevitably, if there’s contention, it will end up dividing down to a blocking lock, so log(numchains) locks before doing a blocking lock.

Processes Blocking locks Nonblocking locks Seconds
5 0-2 1-27 0.03
50 8-12 93-111 0.20
500 13-21 130-170 0.29
5000 309-347 1660-1832 9.1

Sure, that’s a lot of locking when we’re competing with 5000 processes, but it’s less the naive one per chain, and it’s clear that it’s not the cause of the slowdown (we’re doing fewer locks per second than the 5 processes case).

And anyway, locking the entire database cannot be a speed-critical operation.Â  Indeed, after the evidence here, I followed Volker’s suggestion to simply replace tdb_lockall() with the new implementation.

## Bob Jenkins and lookup8.c

I pulled Bob Jenkins’ superlative lookup3.c into ccan some time ago, and with the work on TDB2 (aka 64-bit TDB) I wondered if there was a 64-bit variant available.Â  I didn’t look hard enough: I should have checked his top-level page and seen lookup8.c before wasting his time with an email :(

I did note in my mail that since lookup3.c can return a pair of 32 bit numbers, I could combine those to get a 64 bit hash “but I wondered if it was possible to do better…”

His one-line reply was that he was working on a 128-bit hash now.Â  Respectful and informative.

The gold standard test of online manners is how you respond to a dumb random email.Â  A truly gracious gentleman/lady will respond as if you are far smarter far than you appear.Â  Thanks Bob.Â  And 128-bit; I’m a little stunned…

## On Barriers To Entry

My philosophy for Free/Open Source Software comes down to this: that others can take what I do and do something awesome with it.Â  Since I don’t know what you’ll need, I hand you every capability I have to get you started.Â  Others granted me that very power to get where I am, so I seek to increment that.

It’s not always simple to do: sometimes you build it, and nobody comes.Â  My experience is that clear code, convenient install, documentation and useful functionality all help.Â  An encouraging welcome helps even more, as does getting the software out in front of people.Â  It’s all about avoiding or lowering barriers.

We accept some barriers to modification which are reasonable: if you modify the software, I don’t have to support it.Â  If I make you sign a support deal which says I won’t support the software (even unmodified versions) if you modify or redistribute the software, that’s not reasonable.Â  If you get fired for publishing pre-release modified copyleft source code, that’s reasonable.Â  If you get fired for publishing post-release, that’s not.

The hardware and electricity costs are barriers, but they’re undiscriminating and reasonable, so we accept them. Even the GPL explicitly allows you to charge for costs to make a copy. The software cost of the system is also explicitly allowed as a barrier.Â  The software costs of the build environment are also accepted barriers (though historic for most of us): again even the GPL doesn’t require your software to be buildable with freely available tools.

As this shows, your choice of licensing is among your arsenal in keeping barriers low for co-hackers (who are not necessarily contributors!).Â  I believe copyright gives too much power to the copyright holder, but as copyright is so hard to unmake, I favor the GPL.Â  It tries to use legal power to meet the aims in the first paragraph: to force you to hand onwards all the pieces I handed to you.Â  It’s also well understood by people and that common understanding gels a community.

Yet, as I alluded at the top, there are a realm of barriers which licenses don’t even try to address: the code could be an unmodifiable tangle, the documentation awful, the installation awkward, or the trademarks invasive. A license can’t make coders welcome newcomers, be friendly to upstream, responsive to bugs, write useful code or speak english.

The spectrum of barriers goes roughly from “that’s cool” through “I’m not so comfortable with that” to “that’s not Free/Open Source”.Â  It’s entirely up to your definition of reasonableness;Â  only in the simplest cases will that be the same point at which the license is violated, even if that license is explicitly drafted to defend that freeness!

So, don’t start with an analysis of license clauses.Â  Start with “is that OK?”.Â  Is there a fundamental or only a cosmetic problem? Â  If it’s not OK, ask “does it matter?”.Â  Is it effecting many people, is it setting a bad example, is it harming the project’s future, is it causing consternation among existing developers?Â Â  If it is, then it’s time to look at the license to see if you can do anything.Â  Remember that the license is merely a piece of text.Â  It can’t stop anything, it can only give you leverage to do so.Â  It certainly can’t make using the law easy or convenient, or even worthwhile pursuing.

To close, I will leave the last word to Kim Weatherall, who once told me: “if you’re spending your time on legal issues, you’re doing it wrong”.

## Superfreakonomics; Superplug for Intellectual Ventures.

I enjoyed Levitt & Dubner’s “Freakonomics”, and picked up the followup “Superfreakonomis” recently at an airport.Â  The last chapter, however, was astonishing.Â  The entire chapter was devoted to a glowing advertisement for Intellectual Ventures, pointing out that they own 20,000 patents “more than all but a few dozen companies in the world”, but of course “there is little hard evidence” that they are patent trolls.

But this bunch of wacky genius billionaires have solved global warming (much of which they dispute anyway) and can control malaria and prevent hurricanes from forming.Â  Unlike the rest of the book which covers analysis of well-known facts and disputes them with insightful economic research, this chapter is so breathless and gushy that it makes me question the rest of the author’s work.

I first came across Intellectual Ventures when The Economist reversed their 100-year opposition to patents, and the only reason I could find was a similarly cheerleading piece about this company.Â  (I had naively expected new research revealing some net positive of patents, or some such revelation).

Side note: when a respected information source covers something where you have on-the-ground experience, the result is often to make you wonder how much fecal matter you’ve swallowed in areas outside your own expertise.

So, what is IV actually doing?Â  Buying up loads of patents and licensing them to companies who calculate it’s not worth the fight is patent trolling 101.Â  Yet the scale they’re operating on puts them on new ground, and opens new opportunities.Â  It seems obvious to get corporate investors on board by promising them immunity from patent claims.Â  With enough patents you stop trying to license them one-by-one and just tax each industry at some non-negotiable rate.Â  No doubt they have more tricks I haven’t even thought of, but these potential devices really do make them a new breed of Super Trolls.

Their efforts to actually attain their own patents could simply be more of the same, but it’s also a relatively cheap but clever PR exercise (as shown by their media treatment).Â  This will help them when (legislative?) efforts are made to shut down patent trolls.Â  I’m fairly confident that they’ll simply license rather than implement anything themselves; actually producing things requires much more work, and simply exposes you to others’ patents.

Without diving deeply into this, they seem to understand two things clearly:

1. They learnt from Microsoft that government-enforced monopolies are worth billions.Â  Microsoft had copyright on software, this is patents.
2. Development is getting much cheaper, while patents are getting more valuable.Â Â  Cheaper development is shown clearly by free software, open hardware and hackerspaces.Â  Patent value increases as more of the world becomes a more profitable and enforceable patent target.

Now, I don’t really care if one company leeches off the others.Â  But if they want to tax software, they have to attack free software otherwise people will switch to avoid their patent licensing costs.Â  And if you don’t believe some useful pieces of free software could be effectively banned due to patent violations, you don’t think on the same scale as these guys.

## LWN Quotes Of The Week

I have been petitioning Jon Corbet to introduce a “Supporter” level subscription to LWN; I think given his failure to keep up with inflation and my need to experience conferences vicariously I feel deeply indebted to them.

That lead to me looking through LWN Quote of The Week; there’s no neat index of these things.Â  And I’m not going to create one, but I did troll for my old quotes while procrastinating, and here’s my vanity list:

From the comments, it looks like I got the very first one.Â  Wow.Â  I’m sure there are missing ones (I was sure I got two weeks in a row once).

I’m behind quota for 2010 though; I was surprised my recent fight with Linus didn’t get in.Â  I’ll try harder…

## Typesafe callbacks in C (and gcc)

A classic pattern in C is to hand a generic callback function around which takes a “void *priv” pointer so the function can take arbitrary state (side note: a classic anti-pattern is not to do this, resulting in qsort being reimplemented in Samba so one can be provided!).

The problem with this pattern is that it breaks type safety completely, such as in the following example:

int register_callback(void (*callback)(void *priv), void *priv);
static void my_callback(void *_obj)
{
struct obj *obj = _obj;
...
}
...
register_callback(my_callback, &my_obj);

If I change the type of my_obj, there’s no compiler warning that I’m now handing my callback something it doesn’t expect.

Some time ago, after such a change bit me, I proposed a patch to lkml to rectify this, using a typesafe callback mechanism. It was a significant change, and the kernel tends to be lukewarm on safety issues so it went nowhere. But these thoughts did evolve into CCAN’s typesafe_cb module.

### The tricksiness…

More recently, I tried to use it in libctdb (the new async ctdb library I’ve been toying with), and discovered a fatal flaw. To understand the problem, you have to dive into how I implemented typesafe_cb. At its base is a conditional cast macro: cast_if_type(desttype, expr, oktype). If expr is of type “oktype”, cast it to “desttype”. On compilers which don’t support the fancy gcc builtins needed to do this, this just becomes an unconditional cast “(desttype)(expr)”. This allows us to do the following:

#define register_callback(func, priv) \
_register_callback(cast_if_type(void (*)(void *), (func), void (*)(typeof(priv)))

This says that we cast the func to the generic function type only if it exactly matches the private argument. The real typesafe_cb macro is more complex than this because it needs to ensure that priv is a pointer, but you get the idea.

Now, one great trick is that the callback function can take a “const” (or volatile) pointer of the priv type, and we let that work as well: we have a “cast_if_any” which extends “cast_if_type” to any of three types:

#define typesafe_cb(rtype, fn, arg) \
cast_if_any(rtype (*)(void *), (fn), \
rtype (*)(typeof(*arg)*), \
rtype (*)(const typeof(*arg)*), \
rtype (*)(volatile typeof(*arg)*))

### The flaw…

If your private arg is an undefined type, typeof (*arg) won’t work, and you need this to declare a const pointer to the same type. I have just filed a bug report, but meanwhile, I need a solution.

### The workarounds…

Rather than use cast_if_any, you can insert an explicit call to the callback to evoke a warning if the private arg doesn’t match, then just cast the callback function. This is in fact what I now do, with an additional test that the return type of the function exactly matches the expected return type. cast_if_type() now takes an extra argument, which is the type to test:

#define typesafe_cb(rtype, fn, arg) \
cast_if_type(rtype (*)(void *), (fn), (fn)(arg), rtype)

cast_if_type does a typeof() on (fn)(arg), which will cause a warning if the arg doesn’t match the function, and the cast_if_type will only cast (fn) if the return type matches rtype. You can’t test the return type using a normal test (eg. “rtype _test; sizeof(test = fn(arg));”) because implicit integer promotion makes this compile without a warning even if fn() returns a different integer type.

Unfortunately, the more general typesafe_cb_preargs() and typesafe_cb_postargs() macros lose out. These are like typesafe_cb but for callbacks which take extra arguments (the more common case).

/* This doesn't work: arg might be ptr to undefined struct. */
#define typesafe_cb_preargs(rtype, fn, arg, ...) \
cast_if_any(rtype (*)(__VA_ARGS__, void *), (fn), \
rtype (*)(__VA_ARGS__, typeof(arg)), \
rtype (*)(__VA_ARGS__, const typeof(*arg) *), \
rtype (*)(__VA_ARGS__, volatile typeof(*arg) *))

We can’t rely on testing an indirect call: we’d need example parameters to pass, and because they’d be promoted. The direct call might work fine but an indirect call via a different function signature fail spectacularly. We’re supposed to increase type safety, not reduce it!

We could force the caller to specify the type of the priv arg, eg. “register_callback(func, struct foo *, priv)”. But this strikes me as placing the burden in the wrong place, for an issue I hope will be resolved soonish. So for the moment, you can’t use const or volatile on callback functions:

/* This doesn't work: arg might be ptr to undefined struct. */
#define typesafe_cb_preargs(rtype, fn, arg, ...) \
cast_if_type(rtype (*)(__VA_ARGS__, void *), (fn), (fn), \
rtype (*)(__VA_ARGS__, typeof(arg)))

## More Realistic Hashing: Cache Sensitivity Part II

So my previous post showed a trick with stuffing extra bits into the pointers of a simple hash table.Â  In most real situations, we want the hash to still be useful as the number of elements increases: the simple hash I used will slow down dramatically as we pass 75% utilization.Â  In addition, we want to be able to delete entries, which can be slightly painful in a linear hash.

Since then, I’ve been trying a few different approaches: in fact, I spent quite a while down this rabbit hole before digging myself out.Â  Here are the approaches I benchmarked (on a 64-bit system):

1. Doubling: Linear hash with extra hash bits in the pointers, doubling and rehashing everything when it hits > 5/8 full.
2. Cacheline Tree: Divide the hash into 128-byte groups.Â  Do linear collision resolution within each group, and when a group fills explode it into 16 pointers to separate hash tables of N 128-byte groups.Â  When that fills, explode again, etc.Â  In practice, N was 2.
3. Cacheline Tree Then Linked:Â  Explode the top levelÂ like the above, but allocate a single cacheline (N=1) for the second level and just fill the second level from first bucket onwards.Â  If that fills, simply chain on another cacheline.Â  This actually freed up on delete as chains were emptied.
4. Cacheline Linked: Even simpler: when the toplevel cacheline fills, just chain another cacheline off it.Â  Each cacheline is simply filled in order, but we only put 15 pointers in each cacheline and use the extra 4 bits per entry for additional hash bits (to reduce comparisons).Â  This also freed them up on delete.
5. Chaining: Double linked list through each object: when a bucket fills we simply append the new entry to the existing ones.

Then I benchmarked a 2M-bucket (actually 2097152) hashtable of each type, increasing the number of entries by a factor of 2 at a time from 1M.Â  The good thing about visiting the Canberra office is I could use one of theÂ  older Power machines here with gobs of memory.

The first three tests were very similar to the last post, except I used an array of scrambled pointers so I wasn’t accessing the objects in memory order.Â  The last two are new:

1. Insert all the objects into the hash.
2. Find all the objects in the hash.
3. Do the same number of missed searches in the hash.
4. Delete half the objects, then time reinserting them (steady-state test which avoids any allocation penalties for expansion).
5. Delete every object from the hash.

## Insert

Our simple link of cachelines (CLT Linked) gets so slow I took it out of the benchmark after 8x overload. Chained is the winner: we expect it to modify the inserted entry, the hash bucket and the current top entry for a total of three cache misses (and the object being inserted would generally be hot). It also never needs to (re)alloc or rehash.

We can see the CLT-then-linked case is getting worse as we fall into the chain and have to linear search for the entry to place into; there are 16 times as many of these chains as in the dumb “link of cache lines” case, because we have a link off each pointer off each toplevel hash entry, rather than one off each cacheline.

If we look at reinserts, we can remove the realloc and rehash overhead:

In steady-state we can see the others are more competitive, though even our flat linear hash loses a little as it scans for the next NULL bucket.

## Memory Usage

The optimal reasonable case would be one pointer per object; 8 bytes. The two linked cases quickly approach this (at which point they suck). The chaining cases uses two pointers; doubling uses 2 pointers for these values too (it can do slightly better just before it doubles, and worse just afterwards).

The real story is the cacheline tree. It hits a tipping point as buckets fill: that 512-byte second level off each top level pointer means we can spend 131 bytes per object just after we fill the top level. At 8 million objects in our 1 million buckets we are in fact spending 131 bytes per object (vs 17 for the chained case).

This is why I reduced the second level hash size to two cachelines; and we still get that factor of 16 explosion on each expansion. In fact, the chaining hashes were an attempt to find a “good enough” variant which didn’t explode like this. The chaining variant (CLT then linked) explodes once, to 65 bytes per entry, but then expands linearly.

## Lookups

This is where the cache damage of chaining becomes apparent. Doubling performs best, and the tree is close behind. Even a linked list of cachelines beats chaining: with 15 entries per cacheline on average 13 of those will not need to be examined (since the lower pointer bits where we’ve stashed another 3 hash bits will only accidentally match for two of them). This means 1/5 of the cache misses of the chained case for really large overload values.

By 64 entries per bucket (128M objects) chaining is 16 times slower than doubling for a successful lookup, and 20 times slower for an unsuccessful one. Even at 8 per bucket, it matches our terrible chained case at being 3 times slower than the linear case.

Since I assume most hashes are lookup heavy, this result should be weighted more than add or delete.

## Delete

Chained delete is cheap: you might get three cache misses (one for the object you’re deleting, and one for the next and prev objects in the linked list) but you don’t need to do a hash lookup at all. The linear hash has a cache miss to cache the object, then a cache miss to do the delete, but then it has to rehash any objects in immediately following hash buckets. It’s 2.2 times slower at 64x overload. We could use a special value to avoid this.

Cacheline tree does OK, but still needs to hash and similarly reseat any “runs” within the bottom cacheline. At three levels, add cache misses for the second level and third levels. 3.1 times slower than chained at 64x overload.

## Conclusion

I’d love to spend more time refining these results. In particular, there are ways to steal more bits to store extra hashing, ways to avoid reseating entries on linear deletes, ways to avoid rehashing to double the linear case. Compression of pointers is also possible, but it gets messy since there are a no longer a fixed number of entries per cacheline. I might get time for some of this, but at some point you have to get on with less abstract work!

But the takeaway is this: it’s hard to beat the simplicity of a big hash array which doubles and stashes a few extra hash bits. If your workload is add/delete heavy, use chaining. If you are worried about unknown work sizes, use doubling. Or do what Perl does, and do both (though you can’t use the extra hash bit stowing with chaining).

The cache tree idea is not best at anything, unless you can’t double for some reason (the Linux kernel probably has this issue, and TDB may or may not). Continue reading “More Realistic Hashing: Cache Sensitivity Part II”

## Hash Tables: In A Cache-Sensitive World? (FIXED)

(FIXED: my “NoMatched” bench was bogus, and actually the same as “Matched” case. Graphs changed, but conclusions the same.)
Everyone loves hash tables; certainly the networking code in the kernel uses them heavily and I do too.Â  But with more work on TDB (which is basically a hash table in a file), I’ve been re-measuring some of the fundamentals.

We all know that hash tables are cache-unfriendly; a good hash table means you’re hitting a “random” bucket every time. Â  Back in my university days this was never mentioned (caches were new and fancy things); the optimal hash table was seen as a prime-numbered sequence of buckets using a primary hash then a secondary hash to jump to the next bucket if this was full.Â  This gives fewer lookups than a naive linear search “if this bucket is full, go to the next one”, but does this override the cache unfriendliness of the approach?

These days it’s normal to use a fast universal hash and a power-of-two for the hash table size (Perl 5 and the Linux kernel do this, though Perl uses another weaker hash also by Jenkins).Â  Also, we deal with hash collisions by chaining off entries rather than using other hash buckets.Â  This has two main advantages: (1) it doesn’t barf when the hash table fills up, and (2) you can easily delete an entry from the hash table.Â  Unfortunately, this is pretty cache-unfriendly too, compared to a linear bucket fallback, but can we measure it?

To this end, I wrote a simple example program which filled each kind of hash table (linear hash bucket, secondary hash bucket and chained hash buckets) to 50%, then ran successful lookups, then ran failed lookups.Â  64 bit machine, and I chose 2^27 as my hash size giving a 1G hash table.Â  While most hashes are not this large, it’s useful to approximate a situation where other things are also using the cache.

This result surprised me, so I fiddled a bit and the basic shape remained; the differences vanished if the table was (say) only 10% full, but even at 25% we see chaining win. Here it’s about 15% faster on matched lookups (5% faster on misses).

What does this mean? It looks like linear edges over secondary for insertion, but for lookup it’s a wash.Â  But both are overshadowed by chaining, which doesn’t penalize other buckets when there’s a hash clash.Â  Despite the fact that objects are one pointer larger in the chaining case, it still wins (the test code uses two 64-bit values as the basic object).

## But Now For The Twist…

The reason I was looking at old-school non-chained hashing schemes was because I have a new trick for them: stashing some hash bits in the bottom bits of the pointer.Â  This allows you to skip the dereference 7/8 times (assuming 64-bit-aligned pointers).

So, we add that trick to the linear case:

Now we can see the potential of not chaining through hash entries! It’s a full 25% faster than the normal linear case, and 10% faster than chaining (almost 20% faster in the missed case). Of course, for real life we need to handle delete and hash table filling. And that’s the subject of my next post, once I’ve finished coding it!

## Linux Users of Victoria: Tuesday 6th April

Several months ago Donna Benjamin convinced me to speak at LUV, and now the time is approaching.Â  My intent is to give a talk about bits and pieces, and rely on questions to see what the audience is interested in.Â  Pretty casual and should be heaps of fun if you’re nearby.

I will also be taking the chance to drop by Lonely Planet; one of the places we stayed in Lithuania had seen an influx of foreign tourists since being featured in their guide, so they forced a note and a bottle of their vodka on us to take to LP.Â  Naturally, we Australians all know each other, so I couldn’t turn them down :)