ccan/mem's memeqzero iteration
On Thursday I was writing some code, and I wanted to test if an array was all zero. First I checked if ccan/mem had anything, in case I missed it, then jumped on IRC to ask the author (and overall CCAN co-maintainer) David Gibson about it.
We bikeshedded around names: memallzero? memiszero? memeqz? memeqzero() won by analogy with the already-extant memeq and memeqstr. Then I asked:
rusty: dwg: now, how much time do I waste optimizing?
dwg: rusty, in the first commit, none
Exactly five minutes later I had it implemented and tested.
The Naive Approach: Times: 1/7/310/37064 Bytes: 50
bool memeqzero(const void *data, size_t length) { const unsigned char *p = data; while (length) { if (*p) return false; p++; length--; } return true; }
As a summary, I've give the nanoseconds for searching through 1,8,512 and 65536 bytes only.
Another 20 minutes, and I had written that benchmark, and an optimized version.
128-byte Static Buffer: Times: 6/8/48/5872 Bytes: 108
Here's my first attempt at optimization; using a static array of 128 bytes of zeroes and assuming memcmp is well-optimized for fixed-length comparisons. Worse for small sizes, much better for big.
const unsigned char *p = data; static unsigned long zeroes[16]; while (length > sizeof(zeroes)) { if (memcmp(zeroes, p, sizeof(zeroes))) return false; p += sizeof(zeroes); length -= sizeof(zeroes); } return memcmp(zeroes, p, length) == 0;
Using a 64-bit Constant: Times: 12/12/84/6418 Bytes: 169
dwg: but blowing a cacheline (more or less) on zeroes for comparison, which isn't necessarily a win
Using a single zero uint64_t for comparison is pretty messy:
bool memeqzero(const void *data, size_t length) { const unsigned char *p = data; const unsigned long zero = 0; size_t pre; pre = (size_t)p % sizeof(unsigned long); if (pre) { size_t n = sizeof(unsigned long) - pre; if (n > length) n = length; if (memcmp(p, &zero, n) != 0) return false; p += n; length -= n; } while (length > sizeof(zero)) { if (*(unsigned long *)p != zero) return false; p += sizeof(zero); length -= sizeof(zero); } return memcmp(&zero, p, length) == 0; }
And, worse in every way!
Using a 64-bit Constant With Open-coded Ends: Times: 4/9/68/6444 Bytes: 165
dwg: rusty, what colour is the bikeshed if you have an explicit char * loop for the pre and post?
That's slightly better, but memcmp still wins over large distances, perhaps due to prefetching or other tricks.
Epiphany #1: We Already Have Zeroes: Times 3/5/92/5801 Bytes: 422
Then I realized that we don't need a static buffer: we know everything we've already tested is zero! So I open coded the first 16 byte compare, then memcmp()ed against the previous bytes, doubling each time. Then a final memcmp for the tail. Clever huh?
But it no faster than the static buffer case on the high end, and much bigger.
dwg: rusty, that is brilliant. but being brilliant isn't enough to make things work, necessarily :p
Epiphany #2: memcmp can overlap: Times 3/5/37/2823 Bytes: 307
My doubling logic above was because my brain wasn't completely in phase: unlike memcpy, memcmp arguments can happily overlap! It's still worth doing an open-coded loop to start (gcc unrolls it here with -O3), but after 16 it's worth memcmping with the previous 16 bytes. This is as fast as naive with as little as 2 bytes, and the fastest solution by far with larger numbers:
const unsigned char *p = data; size_t len; /* Check first 16 bytes manually */ for (len = 0; len < 16; len++) { if (!length) return true; if (*p) return false; p++; length--; } /* Now we know that's zero, memcmp with self. */ return memcmp(data, p, length) == 0;
You can find the final code in CCAN (or on Github) including the benchmark code.
Finally, after about 4 hours of random yak shaving, it turns out lightning doesn't even want to use memeqzero() any more! Hopefully someone else will benefit.