I spent some of Friday and yesterday working on my CCAN hashtable module (renamed to htable); I wrote a set of benchmarks and ran them until my laptop steamed and I had burnished the implementation into a sleek shining bundle of awesome.

Then it was time to run the same benchmark on Judy Arrays; code I had only become aware of via Yenga's comment on my blog.  A few weeks back I wrote some friendly wrappers for libJudy, since I found the raw interface made my eyes bleed.  Fortunately, that proved quite simple, but it's worth making a note for those who (like me three weeks ago) aren't familiar with Judy Arrays.  The theory is that different datastructures should be used for different key layouts: high density or sparse, many or few, always concentrating on reducing cache impact.  The result is a serious amount of code for all the different cases: a risky, full-frontal assault on the problem rather than the normal cautious approach of using a single data structure then ignoring or special-casing a few corner cases.

Anyway, my initial benchmark was designed to push hashtables pretty hard, particularly as I used delete markers in my original implementation.  That tends to make performance suck after a great deal of churn, so I churn the hashtable in the middle of the benchmark.  It used consecutive keys, but I switched to spacing them out some more after that proved to be incredibly performant with Judy (which is ordered, unlike a hashtable).

I compared the JudyL implementation with my htable implementation; in fact I ran it at both 75% full (dense), 37% full (sparse) and 37% full with bitstealing disabled (dumb).  75% is high for a typical hashtable, but my implementation seems quite happy operating at that point.  I was interested in mapping an int to an object: for the hash table the object contained the int key as well as a payload pointer (to self, for consistency checking).  For Judy the object simply contained a payload pointer; a Judy array itself contains the key.  In both cases I used 50,000,000 objects, and divided the total time by the number of objects to give nanoseconds per operation.

The Good: Ordered, Dense Keys

First, doing things Judy is good at: the initial insertion of keys 0 to 49,999,999 in order, looking them up in order, looking up keys 50,000,000 to 99,999,999 in order. Note to be fair, you have to add about 150 and 300M to the hashtable memory usage to get the peak usage, since they double-allocate-and-copy, then free the old table.

Measure Judy Dense hash Sparse hash Dumb hash
Memory 420M 640M 895M 895M
Initial insert (ns) 137 332 471 482
Linear hit (ns) 19 174 176 214
Linear miss (ns) 14 234 184 286

The main story here is that Judy in order is unbeatably cache-friendly. It's nice to notice that bitstealing optimization pays off for hashtables as fewer buckets need to be dereferenced, and similarly with lower density, but Judy wins hands down here.

OK, now we lookup random elements and access the elements: the latter is important, since it's more realistic and hash has to access the element to verify it anyway. But since our elements are allocated as a huge array, that's why random accessing them (even via a hash table) costs us more. Oh, and our "random" is actually "10007 apart" meaning it's harder on caches than random would be, which might occasionally be cache hot.

Measure Judy Dense hash Sparse hash Dumb hash
Random hit (ns) 361 330 357 375

The hashtable's doesn't often cache miss, and the denser hash seems to win more than it loses from having to look at more buckets, but with this kind of key density Judy is very competitive anyway.

Now we delete everything and reinsert them all (in order):

Measure Judy Dense hash Sparse hash Dumb hash
Delete (ns) 199 171 200 172
Reinsert (ns) 148 191 197 175

As my hashtable doesn't shrink it gets an artificial benefit, yet Judy again stays very competitive. Our dumb hash does surprisingly well here, too; there's no reason for it to be different from the "smart" sparse hash, but it was a repeatable effect. Perhaps the CPU is doing the noop bit manipulations (anding with all 1s, etc) faster?

The Bad: Unordered Sparse Keys

Now let's look at churn. Fresh datastructures are easy, but over time we want to see what happens as elements are added and deleted. In my case, I iterate through the keys and delete each object then re-add it with the key increased by 50,000,000. With a hashtable using delete markers, this can reveal a weaknesses in the way they are garbage collected. Then I iterate one last time and space the keys out 9 values apart. The objects themselves are still dense, but the keys are far less so (ranging from 0 to 449,999,991: I chose 9 because I was originally playing with a few hundred million objects and didn't want to overflow).

So with sparser keys and a worn-in datastructure, we do our three kinds of lookups again:

Measure Judy Dense hash Sparse hash Dumb hash
Memory 607M 640M 895M 895M
Linear Hit (ns) 61 215 202 265
Linear Miss 60 301 223 278
Random Hit 736 405 386 445

Note that Judy consumes a little more memory at this point, though it's still ahead of the densest hash.

But finally we see a case where a good hashtable is clearly superior: random hits on a full range of keys. The hashtable has suffered a little (there are quite a few deleted records there after all the churn), but 10-20% penalty for churn is not as bad as Judy's 100% penalty for sparser keys.

Last of all, we delete every second object and then re-add with modified keys:

Measure Judy Dense hash Sparse hash Dumb hash
Delete Half (ns) 266 120 103 103
Add Half (ns) 205 200 103 105

Judy is still suffering from those sparse keys, though add is competitive with the dense hash (made even denser by those deleted elements).

Summary

Judy allows for ordered access and let you have an external key rather than putting it into your structure. If those matter, the choice is clear. Other times you don't want a latency spike as the hash table grows and rehashes, or you're concerned about malicious insertion patterns bombing a hashchain, which again favors using Judy arrays.

If you have adjacent access patterns, Judy will win. If you have dense keys, Judy will be competitive. If you have random access and sparse keys, Judy lookups and deletes could be twice as slow as an optimized hash table, but using 20% to 100% less memory.

And one final note on random access: if it's not quite random, such as if every second lookup is near the previous one, time drops from 736 to 410 nanoseconds, back competitive with the hashtable result. It's quite impressive.

[Here are the URLs: the Judy arrays homepage (but your distro probably has a package, Ubuntu does), the ccan wrapper module, and the ccan htable module.]