# ccan/mem's memeqzero iteration

On Thursday I was writing some code, and I wanted to test if an array was all zero. First I checked if ccan/mem had anything, in case I missed it, then jumped on IRC to ask the author (and overall CCAN co-maintainer) David Gibson about it.

We bikeshedded around names: memallzero? memiszero? memeqz? memeqzero() won by analogy with the already-extant memeq and memeqstr. Then I asked:

rusty: dwg: now, how much time do I waste optimizing?

dwg: rusty, in the first commit, none

Exactly five minutes later I had it implemented and tested.

### The Naive Approach: Times: **1/7/310/37064** Bytes: **50**

**1/7/310/37064**

**50**

bool memeqzero(const void *data, size_t length) { const unsigned char *p = data; while (length) { if (*p) return false; p++; length--; } return true; }

As a summary, I've give the nanoseconds for searching through 1,8,512 and 65536 bytes only.

Another 20 minutes, and I had written that benchmark, and an optimized version.

### 128-byte Static Buffer: Times: **6/8/48/5872 **Bytes: **108**

**6/8/48/5872**

**108**

Here's my first attempt at optimization; using a static array of 128 bytes of zeroes and assuming memcmp is well-optimized for fixed-length comparisons. Worse for small sizes, much better for big.

const unsigned char *p = data; static unsigned long zeroes[16]; while (length > sizeof(zeroes)) { if (memcmp(zeroes, p, sizeof(zeroes))) return false; p += sizeof(zeroes); length -= sizeof(zeroes); } return memcmp(zeroes, p, length) == 0;

### Using a 64-bit Constant: Times: **12/12/84/6418 **Bytes: **169**

**12/12/84/6418**

**169**

*dwg: but blowing a cacheline (more or less) on zeroes for comparison, which isn't necessarily a win*

Using a single zero uint64_t for comparison is pretty messy:

bool memeqzero(const void *data, size_t length) { const unsigned char *p = data; const unsigned long zero = 0; size_t pre; pre = (size_t)p % sizeof(unsigned long); if (pre) { size_t n = sizeof(unsigned long) - pre; if (n > length) n = length; if (memcmp(p, &zero, n) != 0) return false; p += n; length -= n; } while (length > sizeof(zero)) { if (*(unsigned long *)p != zero) return false; p += sizeof(zero); length -= sizeof(zero); } return memcmp(&zero, p, length) == 0; }

And, worse in every way!

### Using a 64-bit Constant With Open-coded Ends: Times: **4/9/68/6444 **Bytes: **165**

**4/9/68/6444**

**165**

dwg: rusty, what colour is the bikeshed if you have an explicit char * loop for the pre and post?

That's slightly better, but memcmp still wins over large distances, perhaps due to prefetching or other tricks.

### Epiphany #1: We Already Have Zeroes: Times *3/5/92/5801* Bytes: *422*

Then I realized that we don't need a static buffer: we know everything we've already tested is zero! So I open coded the first 16 byte compare, then memcmp()ed against the previous bytes, doubling each time. Then a final memcmp for the tail. Clever huh?

But it no faster than the static buffer case on the high end, and much bigger.

dwg: rusty, that is brilliant. but being brilliant isn't enough to make things work, necessarily :p

### Epiphany #2: memcmp can overlap: Times *3/5/37/2823* Bytes: *307*

My doubling logic above was because my brain wasn't completely in phase: unlike memcpy, memcmp arguments can happily overlap! It's still worth doing an open-coded loop to start (gcc unrolls it here with -O3), but after 16 it's worth memcmping with the previous 16 bytes. This is as fast as naive with as little as 2 bytes, and the fastest solution by far with larger numbers:

const unsigned char *p = data; size_t len; /* Check first 16 bytes manually */ for (len = 0; len < 16; len++) { if (!length) return true; if (*p) return false; p++; length--; } /* Now we know that's zero, memcmp with self. */ return memcmp(data, p, length) == 0;

You can find the final code in CCAN (or on Github) including the benchmark code.

Finally, after about 4 hours of random yak shaving, it turns out lightning doesn't even want to use memeqzero() any more! Hopefully someone else will benefit.