Not Always Lovely Blooms…

So, with my recent evangelizing of Bloom Filters, Tridge decided to try to apply them on a problem he was having.  An array of several thousand of unsorted strings, each maybe 100 bytes, which needed checking for duplicates.  In the normal case, we’d expect few or no duplicates.

A Bloom Filter for this is quite simple: Wikipedia tells you how to calculate the optimal number of hashes to use and the optimal number of bits given (say) a 1 in a million chance of a false positive.

I handed Tridge some example code and he put it in alongside a naive qsort implementation.  It’s in his junkcode dir.  The result?  qsort scales better, and is about 100x faster.  The reason?  Sorting usually only has to examine the first few characters, but creating N hashes means (in my implementation using the always-awesome Jenkins lookup3 hash) passing over the whole string N/2 times.  That’s always going to lose: even if I coded a single-pass multihash, it’s still having to look at the whole string.

Sometimes, simplicity and standard routines are not just clearer, but faster.

4 replies on “Not Always Lovely Blooms…”

  1. Presumably if you had a *lot* more strings (enough that log(n) becomes bigger than the average string length), bloom filters might start to prove faster.

    1. Tridge says we’re down a factor of 100x at 100,000 strings. That means we need log(n) to grow by a factor of 100, which is for all intents infinite.

      Note that for unaligned strings and cache line size 128, qsort will almost always take a single cacheline per string, bloom filter 2 cachelines about 100/128 times. That’s probably the real killer here (why he saw it get worse as the number of strings gets bigger).

      Also, the Bloom filter size grows linearly with the number of elements (for constant false-positive probability). If you can’t sort in place, the sort solution grows by one pointer per element too. If you can, it’s a lose for Bloom.

      Hope that helps!

  2. Bloom filters are mainly useful when you have far more items to check for than you can store in memory. Even with 16 hashes (which is the upper end of the production bloom filters I’ve seen), that’s only 2 bytes in memory per element that you need to worry about. Compare that to a sort implementation, which requires at least 32 bits/4 bytes for the pointers (really 64/8 for any problem where you should be considering bloom filters), and whatever else you are looking at for your actual data content. It doesn’t take long before even high-end server hardware can run at it’s limit when confronted with counting (for example) monthly unique user,content visits at one billion hits every day.

    But with Bloom filters, you get a factor of 2-20x reduction of in-memory space, and can answer your queries as they come in (as opposed to the batch operation you are really comparing it against). It’s a time/memory trade-off that many people with bigger problems are happy to make.

    When confronted with problems that can be stored in-memory, there are usually going to be better options than ones designed to work in low-memory conditions. In this case, quicksort was it for you. Other people aren’t so lucky to have small problems ;)

Comments are closed.