Bob Jenkins and lookup8.c

I pulled Bob Jenkins’ superlative lookup3.c into ccan some time ago, and with the work on TDB2 (aka 64-bit TDB) I wondered if there was a 64-bit variant available.  I didn’t look hard enough: I should have checked his top-level page and seen lookup8.c before wasting his time with an email :(

I did note in my mail that since lookup3.c can return a pair of 32 bit numbers, I could combine those to get a 64 bit hash “but I wondered if it was possible to do better…”

His one-line reply was that he was working on a 128-bit hash now.  Respectful and informative.

The gold standard test of online manners is how you respond to a dumb random email.  A truly gracious gentleman/lady will respond as if you are far smarter far than you appear.  Thanks Bob.  And 128-bit; I’m a little stunned…

On Barriers To Entry

My philosophy for Free/Open Source Software comes down to this: that others can take what I do and do something awesome with it.  Since I don’t know what you’ll need, I hand you every capability I have to get you started.  Others granted me that very power to get where I am, so I seek to increment that.

It’s not always simple to do: sometimes you build it, and nobody comes.  My experience is that clear code, convenient install, documentation and useful functionality all help.  An encouraging welcome helps even more, as does getting the software out in front of people.  It’s all about avoiding or lowering barriers.

We accept some barriers to modification which are reasonable: if you modify the software, I don’t have to support it.  If I make you sign a support deal which says I won’t support the software (even unmodified versions) if you modify or redistribute the software, that’s not reasonable.  If you get fired for publishing pre-release modified copyleft source code, that’s reasonable.  If you get fired for publishing post-release, that’s not.

The hardware and electricity costs are barriers, but they’re undiscriminating and reasonable, so we accept them. Even the GPL explicitly allows you to charge for costs to make a copy. The software cost of the system is also explicitly allowed as a barrier.  The software costs of the build environment are also accepted barriers (though historic for most of us): again even the GPL doesn’t require your software to be buildable with freely available tools.

As this shows, your choice of licensing is among your arsenal in keeping barriers low for co-hackers (who are not necessarily contributors!).  I believe copyright gives too much power to the copyright holder, but as copyright is so hard to unmake, I favor the GPL.  It tries to use legal power to meet the aims in the first paragraph: to force you to hand onwards all the pieces I handed to you.  It’s also well understood by people and that common understanding gels a community.

Yet, as I alluded at the top, there are a realm of barriers which licenses don’t even try to address: the code could be an unmodifiable tangle, the documentation awful, the installation awkward, or the trademarks invasive. A license can’t make coders welcome newcomers, be friendly to upstream, responsive to bugs, write useful code or speak english.

The spectrum of barriers goes roughly from “that’s cool” through “I’m not so comfortable with that” to “that’s not Free/Open Source”.  It’s entirely up to your definition of reasonableness;  only in the simplest cases will that be the same point at which the license is violated, even if that license is explicitly drafted to defend that freeness!

So, don’t start with an analysis of license clauses.  Start with “is that OK?”.  Is there a fundamental or only a cosmetic problem?   If it’s not OK, ask “does it matter?”.  Is it effecting many people, is it setting a bad example, is it harming the project’s future, is it causing consternation among existing developers?   If it is, then it’s time to look at the license to see if you can do anything.  Remember that the license is merely a piece of text.  It can’t stop anything, it can only give you leverage to do so.  It certainly can’t make using the law easy or convenient, or even worthwhile pursuing.

To close, I will leave the last word to Kim Weatherall, who once told me: “if you’re spending your time on legal issues, you’re doing it wrong”.

Superfreakonomics; Superplug for Intellectual Ventures.

I enjoyed Levitt & Dubner’s “Freakonomics”, and picked up the followup “Superfreakonomis” recently at an airport.  The last chapter, however, was astonishing.  The entire chapter was devoted to a glowing advertisement for Intellectual Ventures, pointing out that they own 20,000 patents “more than all but a few dozen companies in the world”, but of course “there is little hard evidence” that they are patent trolls.

But this bunch of wacky genius billionaires have solved global warming (much of which they dispute anyway) and can control malaria and prevent hurricanes from forming.  Unlike the rest of the book which covers analysis of well-known facts and disputes them with insightful economic research, this chapter is so breathless and gushy that it makes me question the rest of the author’s work.

I first came across Intellectual Ventures when The Economist reversed their 100-year opposition to patents, and the only reason I could find was a similarly cheerleading piece about this company.  (I had naively expected new research revealing some net positive of patents, or some such revelation).

Side note: when a respected information source covers something where you have on-the-ground experience, the result is often to make you wonder how much fecal matter you’ve swallowed in areas outside your own expertise.

So, what is IV actually doing?  Buying up loads of patents and licensing them to companies who calculate it’s not worth the fight is patent trolling 101.  Yet the scale they’re operating on puts them on new ground, and opens new opportunities.  It seems obvious to get corporate investors on board by promising them immunity from patent claims.  With enough patents you stop trying to license them one-by-one and just tax each industry at some non-negotiable rate.  No doubt they have more tricks I haven’t even thought of, but these potential devices really do make them a new breed of Super Trolls.

Their efforts to actually attain their own patents could simply be more of the same, but it’s also a relatively cheap but clever PR exercise (as shown by their media treatment).  This will help them when (legislative?) efforts are made to shut down patent trolls.  I’m fairly confident that they’ll simply license rather than implement anything themselves; actually producing things requires much more work, and simply exposes you to others’ patents.

Without diving deeply into this, they seem to understand two things clearly:

  1. They learnt from Microsoft that government-enforced monopolies are worth billions.  Microsoft had copyright on software, this is patents.
  2. Development is getting much cheaper, while patents are getting more valuable.   Cheaper development is shown clearly by free software, open hardware and hackerspaces.  Patent value increases as more of the world becomes a more profitable and enforceable patent target.

Now, I don’t really care if one company leeches off the others.  But if they want to tax software, they have to attack free software otherwise people will switch to avoid their patent licensing costs.  And if you don’t believe some useful pieces of free software could be effectively banned due to patent violations, you don’t think on the same scale as these guys.

LWN Quotes Of The Week

I have been petitioning Jon Corbet to introduce a “Supporter” level subscription to LWN; I think given his failure to keep up with inflation and my need to experience conferences vicariously I feel deeply indebted to them.

That lead to me looking through LWN Quote of The Week; there’s no neat index of these things.  And I’m not going to create one, but I did troll for my old quotes while procrastinating, and here’s my vanity list:

  1. January 21, 2004
  2. March 14, 2007
  3. March 28, 2007
  4. March 5, 2008
  5. May 7, 2008
  6. December 10, 2008
  7. January 28, 2009
  8. February 11, 2009
  9. April 22, 2009
  10. May 26, 2010

From the comments, it looks like I got the very first one.  Wow.  I’m sure there are missing ones (I was sure I got two weeks in a row once).

I’m behind quota for 2010 though; I was surprised my recent fight with Linus didn’t get in.  I’ll try harder…

Typesafe callbacks in C (and gcc)

A classic pattern in C is to hand a generic callback function around which takes a “void *priv” pointer so the function can take arbitrary state (side note: a classic anti-pattern is not to do this, resulting in qsort being reimplemented in Samba so one can be provided!).

The problem with this pattern is that it breaks type safety completely, such as in the following example:

int register_callback(void (*callback)(void *priv), void *priv);
static void my_callback(void *_obj)
struct obj *obj = _obj;
register_callback(my_callback, &my_obj);

If I change the type of my_obj, there’s no compiler warning that I’m now handing my callback something it doesn’t expect.

Some time ago, after such a change bit me, I proposed a patch to lkml to rectify this, using a typesafe callback mechanism. It was a significant change, and the kernel tends to be lukewarm on safety issues so it went nowhere. But these thoughts did evolve into CCAN’s typesafe_cb module.

The tricksiness…

More recently, I tried to use it in libctdb (the new async ctdb library I’ve been toying with), and discovered a fatal flaw. To understand the problem, you have to dive into how I implemented typesafe_cb. At its base is a conditional cast macro: cast_if_type(desttype, expr, oktype). If expr is of type “oktype”, cast it to “desttype”. On compilers which don’t support the fancy gcc builtins needed to do this, this just becomes an unconditional cast “(desttype)(expr)”. This allows us to do the following:

#define register_callback(func, priv) \
_register_callback(cast_if_type(void (*)(void *), (func), void (*)(typeof(priv)))

This says that we cast the func to the generic function type only if it exactly matches the private argument. The real typesafe_cb macro is more complex than this because it needs to ensure that priv is a pointer, but you get the idea.

Now, one great trick is that the callback function can take a “const” (or volatile) pointer of the priv type, and we let that work as well: we have a “cast_if_any” which extends “cast_if_type” to any of three types:

#define typesafe_cb(rtype, fn, arg) \
cast_if_any(rtype (*)(void *), (fn), \
rtype (*)(typeof(*arg)*), \
rtype (*)(const typeof(*arg)*), \
rtype (*)(volatile typeof(*arg)*))

The flaw…

If your private arg is an undefined type, typeof (*arg) won’t work, and you need this to declare a const pointer to the same type. I have just filed a bug report, but meanwhile, I need a solution.

The workarounds…

Rather than use cast_if_any, you can insert an explicit call to the callback to evoke a warning if the private arg doesn’t match, then just cast the callback function. This is in fact what I now do, with an additional test that the return type of the function exactly matches the expected return type. cast_if_type() now takes an extra argument, which is the type to test:

#define typesafe_cb(rtype, fn, arg) \
cast_if_type(rtype (*)(void *), (fn), (fn)(arg), rtype)

cast_if_type does a typeof() on (fn)(arg), which will cause a warning if the arg doesn’t match the function, and the cast_if_type will only cast (fn) if the return type matches rtype. You can’t test the return type using a normal test (eg. “rtype _test; sizeof(test = fn(arg));”) because implicit integer promotion makes this compile without a warning even if fn() returns a different integer type.

Unfortunately, the more general typesafe_cb_preargs() and typesafe_cb_postargs() macros lose out. These are like typesafe_cb but for callbacks which take extra arguments (the more common case).

/* This doesn't work: arg might be ptr to undefined struct. */
#define typesafe_cb_preargs(rtype, fn, arg, ...) \
cast_if_any(rtype (*)(__VA_ARGS__, void *), (fn), \
rtype (*)(__VA_ARGS__, typeof(arg)), \
rtype (*)(__VA_ARGS__, const typeof(*arg) *), \
rtype (*)(__VA_ARGS__, volatile typeof(*arg) *))

We can’t rely on testing an indirect call: we’d need example parameters to pass, and because they’d be promoted. The direct call might work fine but an indirect call via a different function signature fail spectacularly. We’re supposed to increase type safety, not reduce it!

We could force the caller to specify the type of the priv arg, eg. “register_callback(func, struct foo *, priv)”. But this strikes me as placing the burden in the wrong place, for an issue I hope will be resolved soonish. So for the moment, you can’t use const or volatile on callback functions:

/* This doesn't work: arg might be ptr to undefined struct. */
#define typesafe_cb_preargs(rtype, fn, arg, ...) \
cast_if_type(rtype (*)(__VA_ARGS__, void *), (fn), (fn), \
rtype (*)(__VA_ARGS__, typeof(arg)))

More Realistic Hashing: Cache Sensitivity Part II

So my previous post showed a trick with stuffing extra bits into the pointers of a simple hash table.  In most real situations, we want the hash to still be useful as the number of elements increases: the simple hash I used will slow down dramatically as we pass 75% utilization.  In addition, we want to be able to delete entries, which can be slightly painful in a linear hash.

Since then, I’ve been trying a few different approaches: in fact, I spent quite a while down this rabbit hole before digging myself out.  Here are the approaches I benchmarked (on a 64-bit system):

  1. Doubling: Linear hash with extra hash bits in the pointers, doubling and rehashing everything when it hits > 5/8 full.
  2. Cacheline Tree: Divide the hash into 128-byte groups.  Do linear collision resolution within each group, and when a group fills explode it into 16 pointers to separate hash tables of N 128-byte groups.  When that fills, explode again, etc.  In practice, N was 2.
  3. Cacheline Tree Then Linked:  Explode the top level like the above, but allocate a single cacheline (N=1) for the second level and just fill the second level from first bucket onwards.  If that fills, simply chain on another cacheline.  This actually freed up on delete as chains were emptied.
  4. Cacheline Linked: Even simpler: when the toplevel cacheline fills, just chain another cacheline off it.  Each cacheline is simply filled in order, but we only put 15 pointers in each cacheline and use the extra 4 bits per entry for additional hash bits (to reduce comparisons).  This also freed them up on delete.
  5. Chaining: Double linked list through each object: when a bucket fills we simply append the new entry to the existing ones.

Then I benchmarked a 2M-bucket (actually 2097152) hashtable of each type, increasing the number of entries by a factor of 2 at a time from 1M.  The good thing about visiting the Canberra office is I could use one of the  older Power machines here with gobs of memory.

The first three tests were very similar to the last post, except I used an array of scrambled pointers so I wasn’t accessing the objects in memory order.  The last two are new:

  1. Insert all the objects into the hash.
  2. Find all the objects in the hash.
  3. Do the same number of missed searches in the hash.
  4. Delete half the objects, then time reinserting them (steady-state test which avoids any allocation penalties for expansion).
  5. Delete every object from the hash.


Our simple link of cachelines (CLT Linked) gets so slow I took it out of the benchmark after 8x overload. Chained is the winner: we expect it to modify the inserted entry, the hash bucket and the current top entry for a total of three cache misses (and the object being inserted would generally be hot). It also never needs to (re)alloc or rehash.

We can see the CLT-then-linked case is getting worse as we fall into the chain and have to linear search for the entry to place into; there are 16 times as many of these chains as in the dumb “link of cache lines” case, because we have a link off each pointer off each toplevel hash entry, rather than one off each cacheline.

If we look at reinserts, we can remove the realloc and rehash overhead:

In steady-state we can see the others are more competitive, though even our flat linear hash loses a little as it scans for the next NULL bucket.

Memory Usage

The optimal reasonable case would be one pointer per object; 8 bytes. The two linked cases quickly approach this (at which point they suck). The chaining cases uses two pointers; doubling uses 2 pointers for these values too (it can do slightly better just before it doubles, and worse just afterwards).

The real story is the cacheline tree. It hits a tipping point as buckets fill: that 512-byte second level off each top level pointer means we can spend 131 bytes per object just after we fill the top level. At 8 million objects in our 1 million buckets we are in fact spending 131 bytes per object (vs 17 for the chained case).

This is why I reduced the second level hash size to two cachelines; and we still get that factor of 16 explosion on each expansion. In fact, the chaining hashes were an attempt to find a “good enough” variant which didn’t explode like this. The chaining variant (CLT then linked) explodes once, to 65 bytes per entry, but then expands linearly.


This is where the cache damage of chaining becomes apparent. Doubling performs best, and the tree is close behind. Even a linked list of cachelines beats chaining: with 15 entries per cacheline on average 13 of those will not need to be examined (since the lower pointer bits where we’ve stashed another 3 hash bits will only accidentally match for two of them). This means 1/5 of the cache misses of the chained case for really large overload values.

By 64 entries per bucket (128M objects) chaining is 16 times slower than doubling for a successful lookup, and 20 times slower for an unsuccessful one. Even at 8 per bucket, it matches our terrible chained case at being 3 times slower than the linear case.

Since I assume most hashes are lookup heavy, this result should be weighted more than add or delete.


Chained delete is cheap: you might get three cache misses (one for the object you’re deleting, and one for the next and prev objects in the linked list) but you don’t need to do a hash lookup at all. The linear hash has a cache miss to cache the object, then a cache miss to do the delete, but then it has to rehash any objects in immediately following hash buckets. It’s 2.2 times slower at 64x overload. We could use a special value to avoid this.

Cacheline tree does OK, but still needs to hash and similarly reseat any “runs” within the bottom cacheline. At three levels, add cache misses for the second level and third levels. 3.1 times slower than chained at 64x overload.


I’d love to spend more time refining these results. In particular, there are ways to steal more bits to store extra hashing, ways to avoid reseating entries on linear deletes, ways to avoid rehashing to double the linear case. Compression of pointers is also possible, but it gets messy since there are a no longer a fixed number of entries per cacheline. I might get time for some of this, but at some point you have to get on with less abstract work!

But the takeaway is this: it’s hard to beat the simplicity of a big hash array which doubles and stashes a few extra hash bits. If your workload is add/delete heavy, use chaining. If you are worried about unknown work sizes, use doubling. Or do what Perl does, and do both (though you can’t use the extra hash bit stowing with chaining).

The cache tree idea is not best at anything, unless you can’t double for some reason (the Linux kernel probably has this issue, and TDB may or may not). Continue reading “More Realistic Hashing: Cache Sensitivity Part II”

Hash Tables: In A Cache-Sensitive World? (FIXED)

(FIXED: my “NoMatched” bench was bogus, and actually the same as “Matched” case. Graphs changed, but conclusions the same.)
Everyone loves hash tables; certainly the networking code in the kernel uses them heavily and I do too.  But with more work on TDB (which is basically a hash table in a file), I’ve been re-measuring some of the fundamentals.

We all know that hash tables are cache-unfriendly; a good hash table means you’re hitting a “random” bucket every time.   Back in my university days this was never mentioned (caches were new and fancy things); the optimal hash table was seen as a prime-numbered sequence of buckets using a primary hash then a secondary hash to jump to the next bucket if this was full.  This gives fewer lookups than a naive linear search “if this bucket is full, go to the next one”, but does this override the cache unfriendliness of the approach?

These days it’s normal to use a fast universal hash and a power-of-two for the hash table size (Perl 5 and the Linux kernel do this, though Perl uses another weaker hash also by Jenkins).  Also, we deal with hash collisions by chaining off entries rather than using other hash buckets.  This has two main advantages: (1) it doesn’t barf when the hash table fills up, and (2) you can easily delete an entry from the hash table.  Unfortunately, this is pretty cache-unfriendly too, compared to a linear bucket fallback, but can we measure it?

To this end, I wrote a simple example program which filled each kind of hash table (linear hash bucket, secondary hash bucket and chained hash buckets) to 50%, then ran successful lookups, then ran failed lookups.  64 bit machine, and I chose 2^27 as my hash size giving a 1G hash table.  While most hashes are not this large, it’s useful to approximate a situation where other things are also using the cache.

This result surprised me, so I fiddled a bit and the basic shape remained; the differences vanished if the table was (say) only 10% full, but even at 25% we see chaining win. Here it’s about 15% faster on matched lookups (5% faster on misses).

What does this mean? It looks like linear edges over secondary for insertion, but for lookup it’s a wash.  But both are overshadowed by chaining, which doesn’t penalize other buckets when there’s a hash clash.  Despite the fact that objects are one pointer larger in the chaining case, it still wins (the test code uses two 64-bit values as the basic object).

But Now For The Twist…

The reason I was looking at old-school non-chained hashing schemes was because I have a new trick for them: stashing some hash bits in the bottom bits of the pointer.  This allows you to skip the dereference 7/8 times (assuming 64-bit-aligned pointers).

So, we add that trick to the linear case:

Now we can see the potential of not chaining through hash entries! It’s a full 25% faster than the normal linear case, and 10% faster than chaining (almost 20% faster in the missed case). Of course, for real life we need to handle delete and hash table filling. And that’s the subject of my next post, once I’ve finished coding it!

Linux Users of Victoria: Tuesday 6th April

Several months ago Donna Benjamin convinced me to speak at LUV, and now the time is approaching.  My intent is to give a talk about bits and pieces, and rely on questions to see what the audience is interested in.  Pretty casual and should be heaps of fun if you’re nearby.

I will also be taking the chance to drop by Lonely Planet; one of the places we stayed in Lithuania had seen an influx of foreign tourists since being featured in their guide, so they forced a note and a bottle of their vodka on us to take to LP.  Naturally, we Australians all know each other, so I couldn’t turn them down :)

Followup: lrzip

Mikael noted in my previous post that Con Kolivas’s lrzip is another interesting compressor.  In fact, Con has already done a simple 64-bit enhance of rzip for lrzip, and on our example file it gets 56M vs 55M for xz (lrzip in raw mode, followed by xz, gives 100k worse than just using lrzip: lrzip already uses lzma).

Assuming no bugs in rzip, the takeaway here is simple: rzip should not attempt to find matches within the range that the backend compressor (900k for bzip2 in rzip, 32k for gzip, megabytes for LZMA as used by lrzip).  The backend compressor will do a better job (as shown by similar results with lrzip when I increase the hash array size so it finds more matches: the resulting file is larger).

The rzip algorithm is good at finding matches over huge distances, and that is what it should stick to.  Huge here == size of file (rzip does not stream, for this reason).  And this implies only worrying about large matches over huge distances (the current 32 byte minimum is probably too small).  The current version of rzip uses an mmap window so it never has to seek, but this window is artificially limited to 900MB (or 60% of mem in lrzip).   If we carefully limit the number of comparisons with previous parts of the file, we may be able to reduce them to the point where we don’t confuse the readahead algorithms and thus get nice performance (fadvise may help here too) whether we are mmaped or seeking.

I like the idea that rzip should scale with the size of the file being compressed, not make assumptions about today’s memory sizes.  Though some kind of thrash detection using mincore might be necessary to avoid killing our dumb mm systems :(