On Thursday I was writing some code, and I wanted to test if an array was all zero.  First I checked if ccan/mem had anything, in case I missed it, then jumped on IRC to ask the author (and overall CCAN co-maintainer) David Gibson about it.

We bikeshedded around names: memallzero? memiszero? memeqz? memeqzero() won by analogy with the already-extant memeq and memeqstr. Then I asked:

rusty: dwg: now, how much time do I waste optimizing?
dwg: rusty, in the first commit, none

Exactly five minutes later I had it implemented and tested.

### The Naive Approach: Times: 1/7/310/37064 Bytes: 50

bool memeqzero(const void *data, size_t length)
{
const unsigned char *p = data;

while (length) {
if (*p)
return false;
p++;
length--;
}
return true;
}

As a summary, I've give the nanoseconds for searching through 1,8,512 and 65536 bytes only.

Another 20 minutes, and I had written that benchmark, and an optimized version.

### 128-byte Static Buffer: Times: 6/8/48/5872 Bytes: 108

Here's my first attempt at optimization; using a static array of 128 bytes of zeroes and assuming memcmp is well-optimized for fixed-length comparisons.  Worse for small sizes, much better for big.

const unsigned char *p = data;
static unsigned long zeroes[16];

while (length > sizeof(zeroes)) {
if (memcmp(zeroes, p, sizeof(zeroes)))
return false;
p += sizeof(zeroes);
length -= sizeof(zeroes);
}
return memcmp(zeroes, p, length) == 0;

### Using a 64-bit Constant: Times: 12/12/84/6418 Bytes: 169

dwg: but blowing a cacheline (more or less) on zeroes for comparison, which isn't necessarily a win

Using a single zero uint64_t for comparison is pretty messy:

bool memeqzero(const void *data, size_t length)
{
const unsigned char *p = data;
const unsigned long zero = 0;
size_t pre;
pre = (size_t)p % sizeof(unsigned long);
if (pre) {
size_t n = sizeof(unsigned long) - pre;
if (n > length)
n = length;
if (memcmp(p, &zero, n) != 0)
return false;
p += n;
length -= n;
}
while (length > sizeof(zero)) {
if (*(unsigned long *)p != zero)
return false;
p += sizeof(zero);
length -= sizeof(zero);
}
return memcmp(&zero, p, length) == 0;
}

And, worse in every way!

### Using a 64-bit Constant With Open-coded Ends: Times: 4/9/68/6444 Bytes: 165

dwg: rusty, what colour is the bikeshed if you have an explicit char * loop for the pre and post?

That's slightly better, but memcmp still wins over large distances, perhaps due to prefetching or other tricks.

### Epiphany #1: We Already Have Zeroes: Times 3/5/92/5801 Bytes: 422

Then I realized that we don't need a static buffer: we know everything we've already tested is zero!  So I open coded the first 16 byte compare, then memcmp()ed against the previous bytes, doubling each time.  Then a final memcmp for the tail.  Clever huh?

But it no faster than the static buffer case on the high end, and much bigger.

dwg: rusty, that is brilliant. but being brilliant isn't enough to make things work, necessarily :p

### Epiphany #2: memcmp can overlap: Times 3/5/37/2823 Bytes: 307

My doubling logic above was because my brain wasn't completely in phase: unlike memcpy, memcmp arguments can happily overlap!  It's still worth doing an open-coded loop to start (gcc unrolls it here with -O3), but after 16 it's worth memcmping with the previous 16 bytes.  This is as fast as naive with as little as 2 bytes, and the fastest solution by far with larger numbers:

const unsigned char *p = data;
size_t len;

/* Check first 16 bytes manually */
for (len = 0; len < 16; len++) {
if (!length)
return true;
if (*p)
return false;
p++;
length--;
}

/* Now we know that's zero, memcmp with self. */
return memcmp(data, p, length) == 0;

You can find the final code in CCAN (or on Github) including the benchmark code.

Finally, after about 4 hours of random yak shaving, it turns out lightning doesn't even want to use memeqzero() any more!  Hopefully someone else will benefit.