Rusty Russell's Coding Blog | Stealing From Smart People

Oct/15

20

ccan/mem’s memeqzero iteration

On Thursday I was writing some code, and I wanted to test if an array was all zero.  First I checked if ccan/mem had anything, in case I missed it, then jumped on IRC to ask the author (and overall CCAN co-maintainer) David Gibson about it.

We bikeshedded around names: memallzero? memiszero? memeqz? memeqzero() won by analogy with the already-extant memeq and memeqstr. Then I asked:

rusty: dwg: now, how much time do I waste optimizing?
dwg: rusty, in the first commit, none

Exactly five minutes later I had it implemented and tested.

The Naive Approach: Times: 1/7/310/37064 Bytes: 50

bool memeqzero(const void *data, size_t length)
{
    const unsigned char *p = data;

    while (length) {
        if (*p)
            return false;
        p++;
        length--;
    }
    return true;
}

As a summary, I’ve give the nanoseconds for searching through 1,8,512 and 65536 bytes only.

Another 20 minutes, and I had written that benchmark, and an optimized version.

128-byte Static Buffer: Times: 6/8/48/5872 Bytes: 108

Here’s my first attempt at optimization; using a static array of 128 bytes of zeroes and assuming memcmp is well-optimized for fixed-length comparisons.  Worse for small sizes, much better for big.

 const unsigned char *p = data;
 static unsigned long zeroes[16];

 while (length > sizeof(zeroes)) {
     if (memcmp(zeroes, p, sizeof(zeroes)))
         return false;
     p += sizeof(zeroes);
     length -= sizeof(zeroes);
 }
 return memcmp(zeroes, p, length) == 0;

Using a 64-bit Constant: Times: 12/12/84/6418 Bytes: 169

dwg: but blowing a cacheline (more or less) on zeroes for comparison, which isn’t necessarily a win

Using a single zero uint64_t for comparison is pretty messy:

bool memeqzero(const void *data, size_t length)
{
    const unsigned char *p = data;
    const unsigned long zero = 0;
    size_t pre;
    pre = (size_t)p % sizeof(unsigned long);
    if (pre) {
        size_t n = sizeof(unsigned long) - pre;
        if (n > length)
            n = length;
        if (memcmp(p, &zero, n) != 0)
            return false;
        p += n;
        length -= n;
    }
    while (length > sizeof(zero)) {
        if (*(unsigned long *)p != zero)
            return false;
        p += sizeof(zero);
        length -= sizeof(zero);
    }
    return memcmp(&zero, p, length) == 0;
}

And, worse in every way!

Using a 64-bit Constant With Open-coded Ends: Times: 4/9/68/6444 Bytes: 165

dwg: rusty, what colour is the bikeshed if you have an explicit char * loop for the pre and post?

That’s slightly better, but memcmp still wins over large distances, perhaps due to prefetching or other tricks.

Epiphany #1: We Already Have Zeroes: Times 3/5/92/5801 Bytes: 422

Then I realized that we don’t need a static buffer: we know everything we’ve already tested is zero!  So I open coded the first 16 byte compare, then memcmp()ed against the previous bytes, doubling each time.  Then a final memcmp for the tail.  Clever huh?

But it no faster than the static buffer case on the high end, and much bigger.

dwg: rusty, that is brilliant. but being brilliant isn’t enough to make things work, necessarily :p

Epiphany #2: memcmp can overlap: Times 3/5/37/2823 Bytes: 307

My doubling logic above was because my brain wasn’t completely in phase: unlike memcpy, memcmp arguments can happily overlap!  It’s still worth doing an open-coded loop to start (gcc unrolls it here with -O3), but after 16 it’s worth memcmping with the previous 16 bytes.  This is as fast as naive with as little as 2 bytes, and the fastest solution by far with larger numbers:

 const unsigned char *p = data;
 size_t len;

 /* Check first 16 bytes manually */
 for (len = 0; len < 16; len++) {
     if (!length)
         return true;
     if (*p)
         return false;
     p++;
     length--;
 }

 /* Now we know that's zero, memcmp with self. */
 return memcmp(data, p, length) == 0;

You can find the final code in CCAN (or on Github) including the benchmark code.

Finally, after about 4 hours of random yak shaving, it turns out lightning doesn’t even want to use memeqzero() any more!  Hopefully someone else will benefit.

RSS Feed

15 Comments for ccan/mem’s memeqzero iteration

Amit Shah | October 20, 2015 at 4:19 pm

Aren’t you missing a

p = data;

before the memcmp? ‘p’ would’ve advanced in that while loop..

Amit Shah | October 20, 2015 at 5:28 pm

Argh, no. The code’s correct.

The len and length confused me, but this is a nice use of overlaps!

Avi Kivity | October 21, 2015 at 3:59 pm

Which of these is faster will change with architecture, microarchitecture, and load (a more complex approach may win on long regions but lose on short regions with random alignments and lengths, due to branch mispredictions). The measurements will change depending on what you measure. It can even change due to the contents (if you optimize for fast-fail or for expected success).

If this function really matters for your load, you have to drop into assembly and code something that optimizes for the expected arguments. If not, I wouldn’t bother. But I don’t think an implementation can be both generic and very efficient.

Author comment by rusty | October 21, 2015 at 5:11 pm

That’s why the fall to memcmp, where this work has been done for you. I do intend to test on a few other platforms, though. Will report back…

Avi Kivity | October 22, 2015 at 3:17 am

Probably the fastest approach is to load aligned 16/32/64 byte chunks (depending on which SIMD extension you have), mask off the header or trailer, and compare against zero. It should be particularly easy/fast with AVX-512 where you have mask operations; you can probably do 128 bytes/cycle this way.

The memcmp approach sacrifices one of the load ports, and forces unaligned loads if you have 32+ byte SIMD (AVX-2+).

Paolo Bonzini | October 23, 2015 at 1:51 am

Avi, memcmp is already tuned in theory to use unaligned only if it matters. However, because _my_ bikeshed has a nicer color, I would look into optimizing the early check. On a big-endian machine, for example, if the block is made of pointers (e.g. a page in a virtual machine that you’re migrating :)) it may be that the early 2-3 bytes are zero.

If your machine supports unaligned loads, then, it’s worth aligning the length instead of the pointer, and letting memcmp align the pointer in the rare case:

const unsigned char *p = data;
unsigned long word;

while (__builtin_expect(length & (sizeof(word) – 1), 0)) {
if (*p)
return false;
p++;
length–;
}
while (__builtin_expect(length & (16 – sizeof(word)), 0)) {
memcpy(&word, p, sizeof(word));
if (word)
return false;
p += sizeof(word);
length -= sizeof(word);
}

/* Now we know that’s zero, memcmp with self. */
return length == 0 || memcmp(data, p, length) == 0;

This clocks a bit faster than your final version (7/25/93/4990 for yours, 8/9/60/4982 for mine). On an Intel Ivy Bridge, the times are pretty much the same for aligned and unaligned buffers.

My full benchmark code is at https://gist.githubusercontent.com/bonzini/9a95b0e02d1ceb60af9e/raw/b067c939991e4201156202c0b9498b944cc6da1f/gistfile1.txt

Paolo Bonzini | October 23, 2015 at 2:46 am

Of course there’s a bug in there (thanks Eric Blake). Fixed code remains at https://gist.github.com/bonzini/9a95b0e02d1ceb60af9e. Ugly, but consistently 30-40 cycles faster for lengths >= 16.

Avi Kivity | October 23, 2015 at 4:39 pm

If you use AVX-2 or AVX-3 and memcmp with a 16-byte offset, you’re forcing it to use unaligned loads on one of of the loads (and of course, issuing twice the number of loads compared to the minimum needed).

Pádraig Brady | October 24, 2015 at 1:44 am

@Paolo since your version mainly improves the overhead for small buffers, it’s worth avoiding the memcmp() function call overhead for the length=16 case? I.E. the following, which benchmarks much the same for values other than 16:

const unsigned char *p = data;
unsigned long word;

if (!length)
return true;

/* Check len bytes not aligned on a word. */
while (__builtin_expect(length & (sizeof(word) – 1), 0)) {
if (*p)
return false;
p++;
length–;
if (!length)
return true;
}

/* Check up to 16 bytes a word at a time. */
for (;;) {
memcpy(&word, p, sizeof(word));
if (word)
return false;
p += sizeof(word);
length -= sizeof(word);
if (!length)
return true;
if (__builtin_expect(length & 15, 0) == 0)
break;
}

/* Now we know that’s zero, memcmp with self. */
return memcmp(data, p, length) == 0;

Paolo Bonzini | October 24, 2015 at 10:51 am

Yes, of course Padraig’s version is better! (As long as you can do unaligned loads. Otherwise, just declare word as uint8_t and it will degenerate to Rusty’s version. Is there autoconf magic to detect platforms that do not support unaligned loads?).

Pádraig Brady | October 24, 2015 at 11:56 am

Note glibc defines _STRING_ARCH_unaligned

Victor Kaplansky | October 30, 2015 at 1:44 am

Actually the “naive” approach can beat memcmp() if you do it “right” way. See the results at the end of the comment.

I’ve tried to measure performance of the following two functions performing scanning for non zero value
using memcmp() trick and a “straightforward” loop:

$ cat chz_memcmp.c
#include

char
check_zero(char *buf, int len)
{
long *p = (long *)buf;
if (p[0] != 0 || p[1] != 0 || p[2] != 0 || p[3] != 0)
return 1;
if (p[4] != 0 || p[5] != 0 || p[6] != 0 || p[7] != 0)
return 1;

return memcmp(buf, buf + 8 * sizeof(long), len – 8 * sizeof(long));
}

vs:

$ cat chz_vect.c
char
check_zero(char *p, int len)
{
char res = 0;
int i;

for (i = 0; i < len; i++) {
res = res | p[i];
}

return res;
}

Looks innocent, right?
Here is a simple program to measure the performance of both:

$ cat test.c
#include

#define BUFLEN (100*1024*1024)

static char arr[BUFLEN];

char check_zero(char *p, int len);

int
main()
{
int i;
for (i = 0; i < 1000; i++) {
if (check_zero(arr, BUFLEN)) {

printf("Non zero !!!\n");
}
}
return 0;
}

Now, for the results:
$ gcc -O3 test.c chz_memcmp.c && time ./a.out
real 2.763 user 2.745 sys 0.009 pcpu 99.67

$ gcc -O3 test.c chz_vect.c && time ./a.out
real 3.639 user 3.630 sys 0.005 pcpu 99.91

memcmp way beats the straightforward loop, but the trick is to use auto-vectorizer
and unroll loops:

$ gcc -O3 –tree-vectorize –unroll-loops test.c chz_vect.c && time ./a.out
real 2.082 user 2.073 sys 0.006 pcpu 99.86

So, the very "naive" loop beats highly optimized memcmp()!

Actually you have to compile specially only chz_vect.c:

$ gcc -O3 –unroll-loops -c chz_vect.c && gcc -O0 -g test.c chz_vect.o && time ./a.out
real 2.079 user 2.067 sys 0.009 pcpu 99.85

Can you do faster :-) ?
— Victor

PS. auto-vectorizer is part of -O3, so it needs only a little help with loop unrolling.

Paolo Bonzini | November 3, 2015 at 7:23 am

Victor, what if the common case is that the argument is not all-zeros (but you still want it to be fast in case it _is_ all-zeros, or has a relatively long all-zero prefix)?

Pádraig Brady | November 6, 2015 at 7:57 pm

Well theoretically memcmp() has a harder job since it must impart order, whereas the requirement here is just boolean.

Michael S. Tsirkin | November 11, 2015 at 8:11 pm

Well QEMU actually does this when migrating guests, scanning dirty memory bitmap and migrating dirty memory.
What QEMU actually wants is an offset of the first non-zero bit though.
This pretty much rules out the memcmp trick, does it not?

«

»

Find it!

Theme Design by devolux.org

Tag Cloud