ccan/mem's memeqzero iteration
On Thursday I was writing some code, and I wanted to test if an array was all zero. First I checked if ccan/mem had anything, in case I missed it, then jumped on IRC to ask the author (and overall CCAN co-maintainer) David Gibson about it.
We bikeshedded around names: memallzero? memiszero? memeqz? memeqzero() won by analogy with the already-extant memeq and memeqstr. Then I asked:
rusty: dwg: now, how much time do I waste optimizing?
dwg: rusty, in the first commit, none
Exactly five minutes later I had it implemented and tested.
The Naive Approach: Times: 1/7/310/37064 Bytes: 50
bool memeqzero(const void *data, size_t length)
{
const unsigned char *p = data;
while (length) {
if (*p)
return false;
p++;
length--;
}
return true;
}
As a summary, I've give the nanoseconds for searching through 1,8,512 and 65536 bytes only.
Another 20 minutes, and I had written that benchmark, and an optimized version.
128-byte Static Buffer: Times: 6/8/48/5872 Bytes: 108
Here's my first attempt at optimization; using a static array of 128 bytes of zeroes and assuming memcmp is well-optimized for fixed-length comparisons. Worse for small sizes, much better for big.
const unsigned char *p = data;
static unsigned long zeroes[16];
while (length > sizeof(zeroes)) {
if (memcmp(zeroes, p, sizeof(zeroes)))
return false;
p += sizeof(zeroes);
length -= sizeof(zeroes);
}
return memcmp(zeroes, p, length) == 0;
Using a 64-bit Constant: Times: 12/12/84/6418 Bytes: 169
dwg: but blowing a cacheline (more or less) on zeroes for comparison, which isn't necessarily a win
Using a single zero uint64_t for comparison is pretty messy:
bool memeqzero(const void *data, size_t length)
{
const unsigned char *p = data;
const unsigned long zero = 0;
size_t pre;
pre = (size_t)p % sizeof(unsigned long);
if (pre) {
size_t n = sizeof(unsigned long) - pre;
if (n > length)
n = length;
if (memcmp(p, &zero, n) != 0)
return false;
p += n;
length -= n;
}
while (length > sizeof(zero)) {
if (*(unsigned long *)p != zero)
return false;
p += sizeof(zero);
length -= sizeof(zero);
}
return memcmp(&zero, p, length) == 0;
}
And, worse in every way!
Using a 64-bit Constant With Open-coded Ends: Times: 4/9/68/6444 Bytes: 165
dwg: rusty, what colour is the bikeshed if you have an explicit char * loop for the pre and post?
That's slightly better, but memcmp still wins over large distances, perhaps due to prefetching or other tricks.
Epiphany #1: We Already Have Zeroes: Times 3/5/92/5801 Bytes: 422
Then I realized that we don't need a static buffer: we know everything we've already tested is zero! So I open coded the first 16 byte compare, then memcmp()ed against the previous bytes, doubling each time. Then a final memcmp for the tail. Clever huh?
But it no faster than the static buffer case on the high end, and much bigger.
dwg: rusty, that is brilliant. but being brilliant isn't enough to make things work, necessarily :p
Epiphany #2: memcmp can overlap: Times 3/5/37/2823 Bytes: 307
My doubling logic above was because my brain wasn't completely in phase: unlike memcpy, memcmp arguments can happily overlap! It's still worth doing an open-coded loop to start (gcc unrolls it here with -O3), but after 16 it's worth memcmping with the previous 16 bytes. This is as fast as naive with as little as 2 bytes, and the fastest solution by far with larger numbers:
const unsigned char *p = data;
size_t len;
/* Check first 16 bytes manually */
for (len = 0; len < 16; len++) {
if (!length)
return true;
if (*p)
return false;
p++;
length--;
}
/* Now we know that's zero, memcmp with self. */
return memcmp(data, p, length) == 0;
You can find the final code in CCAN (or on Github) including the benchmark code.
Finally, after about 4 hours of random yak shaving, it turns out lightning doesn't even want to use memeqzero() any more! Hopefully someone else will benefit.