Not Always Lovely Blooms…

So, with my recent evangelizing of Bloom Filters, Tridge decided to try to apply them on a problem he was having. An array of several thousand of unsorted strings, each maybe 100 bytes, which needed checking for duplicates. In the normal case, we'd expect few or no duplicates.

A Bloom Filter for this is quite simple: Wikipedia tells you how to calculate the optimal number of hashes to use and the optimal number of bits given (say) a 1 in a million chance of a false positive.

I handed Tridge some example code and he put it in alongside a naive qsort implementation. It's in his junkcode dir. The result? qsort scales better, and is about 100x faster. The reason? Sorting usually only has to examine the first few characters, but creating N hashes means (in my implementation using the always-awesome Jenkins lookup3 hash) passing over the whole string N/2 times. That's always going to lose: even if I coded a single-pass multihash, it's still having to look at the whole string.

Sometimes, simplicity and standard routines are not just clearer, but faster.