Rusty Russell's Coding Blog | Stealing From Smart People

Oct/09

27

Not Always Lovely Blooms…

So, with my recent evangelizing of Bloom Filters, Tridge decided to try to apply them on a problem he was having.  An array of several thousand of unsorted strings, each maybe 100 bytes, which needed checking for duplicates.  In the normal case, we’d expect few or no duplicates.

A Bloom Filter for this is quite simple: Wikipedia tells you how to calculate the optimal number of hashes to use and the optimal number of bits given (say) a 1 in a million chance of a false positive.

I handed Tridge some example code and he put it in alongside a naive qsort implementation.  It’s in his junkcode dir.  The result?  qsort scales better, and is about 100x faster.  The reason?  Sorting usually only has to examine the first few characters, but creating N hashes means (in my implementation using the always-awesome Jenkins lookup3 hash) passing over the whole string N/2 times.  That’s always going to lose: even if I coded a single-pass multihash, it’s still having to look at the whole string.

Sometimes, simplicity and standard routines are not just clearer, but faster.

RSS Feed

3 Comments for Not Always Lovely Blooms…

Robert Collins | October 27, 2009 at 9:50 pm

We’ve found that bloom filters are really marginal much of the time – its a very clever idea, but you have to have rather precise circumstances for them to be really useful.

Anonymous | October 29, 2009 at 8:20 pm

Presumably if you had a *lot* more strings (enough that log(n) becomes bigger than the average string length), bloom filters might start to prove faster.

Author comment by rusty | October 29, 2009 at 9:17 pm

Tridge says we’re down a factor of 100x at 100,000 strings. That means we need log(n) to grow by a factor of 100, which is for all intents infinite.

Note that for unaligned strings and cache line size 128, qsort will almost always take a single cacheline per string, bloom filter 2 cachelines about 100/128 times. That’s probably the real killer here (why he saw it get worse as the number of strings gets bigger).

Also, the Bloom filter size grows linearly with the number of elements (for constant false-positive probability). If you can’t sort in place, the sort solution grows by one pointer per element too. If you can, it’s a lose for Bloom.

Hope that helps!

Leave a comment!

«

»

Find it!

Theme Design by devolux.org

Tag Cloud