GCC and C vs C++ Speed, Measured.

With the imminent release of gcc 4.8, GCC has finally switched to C++ as the implementation language.  As usual, LWN has excellent coverage.  Those with long memories will remember Linux trying to use g++ back in 1992 and retreating in horror at the larger, slower code.  The main benefit was stricter typechecking, particularly for enums (a great idea: I had -Wstrict-enum patches for gcc about 12 years ago, which was a superset of the -Wenum-compare we have now, but never got it merged).

With this in mind, and Ian Taylor’s bold assertion that “The C subset of C++ is as efficient as C”, I wanted to test what had changed with some actual measurements.  So I grabbed gcc 4.7.2 (the last release which could do this), and built it with C and C++ compilers:

  1. ../gcc-4.7.2/configure –prefix=/usr/local/gcc-c –disable-bootstrap –enable-languages=c,c++ –disable-multiarch –disable-multilib
  2. ../gcc-4.7.2/configure –prefix=/usr/local/gcc-cxx –disable-bootstrap –enable-languages=c,c++ –disable-multiarch –disable-multilib –enable-build-with-cxx

The C++-compiled binaries are slightly larger, though that’s mostly debug info:

  1. -rwxr-xr-x 3 rusty rusty 1886551 Mar 18 17:13 /usr/local/gcc-c/bin/gcc
    text       data        bss        dec        hex    filename
    552530       3752       6888     563170      897e2    /usr/local/gcc-c/bin/gcc
  2. -rwxr-xr-x 3 rusty rusty 1956593 Mar 18 17:13 /usr/local/gcc-cxx/bin/gcc
    text       data        bss        dec        hex    filename
    552731       3760       7176     563667      899d3    /usr/local/gcc-cxx/bin/gcc

Then I used them both to compile a clean Linux kernel 10 times:

  1. for i in `seq 10`; do time make -s CC=/usr/local/gcc-c/bin/gcc 2>/dev/null; make -s clean; done
  2. for i in `seq 10`; do time make -s CC=/usr/local/gcc-cxx/bin/gcc 2>/dev/null; make -s clean; done

Using stats –trim-outliers, which throws away best and worse, and we have the times for the remaining 8:

  1. real    14m24.359000-35.107000(25.1521+/-0.62)s
    user    12m50.468000-52.576000(50.912+/-0.23)s
    sys    1m24.921000-27.465000(25.795+/-0.31)s
  2. real    14m27.148000-29.635000(27.8895+/-0.78)s
    user    12m50.428000-52.852000(51.956+/-0.7)s
    sys    1m26.597000-29.274000(27.863+/-0.66)s

So the C++-compiled binaries are measurably slower, though not noticably: it’s about 865 seconds vs 868 seconds, or about .3%.  Even if a kernel compile spends half its time linking, statting, etc, that’s under 1% slowdown.

And it’s perfectly explicable by the larger executable size.  If we strip all the gcc binaries, and do another 10 runs of each (… flash forward to the next day.. oops, powerfail, make that 2 days later):

  1. real    14m24.659000-33.435000(26.1196+/-0.65)s
    user    12m50.032000-57.701000(50.9755+/-0.36)s
    sys    1m26.057000-28.406000(26.863+/-0.36)s
  2. real    14m26.811000-29.284000(27.1308+/-0.17)s
    user    12m51.428000-52.696000(52.156+/-0.39)s
    sys    1m26.157000-27.973000(26.869+/-0.41)s

Now the difference is 0.1%, pretty much in the noise.

Summary: so whether you like C++ or not, the performance argument is moot.

Looking forward to linux.conf.au 2013

This year’s organizers took specific pains to attract deep content, and the schedule reflects that: there are very few slots where I’m not torn between two topics.  This will be great fun!

After a little introspection, I did not submit a talk this year.  My work in 2012 was with Linaro helping with KVM on ARM: that topic is better addressed by Christoffer Dall, so I convinced him to submit (unfortunately, he withdrew as January became an untenable time for him to travel).  My other coding work was incremental, not revolutionary: module signatures, CCAN nor ntdb shook the ground this year.  There just wasn’t anything I was excited about: a reliable litmus test.

See you at LCA!

Fixed-length semi-lockless queues revisited

There were some great comments on my previous post, both in comments here and on the Google Plus post which points to it.  I’d like to address the point here, now I’ve had a few moments to do follow-up work.

One anonymous commenter, as well as Stephen Hemminger via email, point to the existing lockless queue code in liburcu.  I had actually waded through this before (I say waded, because it’s wrapped in a few layers which I find annoying; there’s a reason I write little CCAN modules).  It’s clever and genuinely lockless; my hat off to , but it only works in conjunction with RCU.  In particular, it’s an unlimited-length queue which uses a dummy element to avoid ever being empty, and the fact that it can safely traverse the ‘->next’ entry even as an element is being dequeued, because the rules say you can’t alter that field or free the object until later.

Stephen also pointed me to Kip Macy’s buf_ring from FreeBSD; it uses two producer counters, prod_head and prod_tail.  The consumer looks at prod_tail as usual, the producers compare and swap increment prod_head, then place their element, then wait for prod_tail to catch up with prod_head before incrementing prod_tail.  Reimplementing this in my code showed it to be slower than the lower-bit-to-lock case for my benchmarks, though not much (the main difference is in the single-producer-using-muliple-producer-safe-routines, which are the first three benchmarks).  I ignored the buf_ring consumer, which uses a similar two-counter scheme for consumers, which is only useful for debugging, and used the same consumer code as before.

Arjen van de Ven makes several excellent points.  Firstly, that transaction-style features may allow efficient lock-ellision in upcoming Intel CPUs (and, of course, PowerPC has announced transaction support for Power8), so we’ll have to revisit in a few years when that reaches me.

His more immediate point is thatuncontended locks are really cheap on recent CPUs; cheaper than cache-hot compare-and-swap operations.  All the benchmarks I did involve everyone banging on the queue all the time, so I’m only measuring the contended cases.  So I hacked my benchmarks to allow for “0 consumers” by having the producer discard all the queue contents every time it filled.  Similarly, filling the queue with junk when it’s empty for a “0 producers” benchmark.

Here we can see that the dumb, single lock comes into its own, being twice as fast as my optimal-when-contended version.  If we just consider the common case of a single writer and a single reader, the lockless implementation takes 24ns in the contended case, and 14ns in the uncontended cases, whereas the naive locked implementation takes 107ns in the contended case and 7ns in the uncontended case.  In other words, you’d have to be uncontended over 90% of the time to win.  That can’t happen in a naive implementation which wakes the consumer as soon as the first item has been inserted into the queue (and if you implement a batch version of queue_insert, the atomic exchange gets amortized, so it gets harder to beat).

For the moment, I’m sticking with the previous winner; there’s still much to do to turn it into a usable API.

Fixed-length semi-lockless queues…

One of my vacation project was to look at a good queue implementation for ccan/antithread.  I read a few papers, which mainly deal with generic link-list-style queues (I got quite excited about one before I realized that it needed a 128-bit compare-and-swap for 64 bit machines).  I only really need a fixed-length queue of void *, so I set about implementing one.

You can find the cleaned-up version of my explorations on github.  For my implementation I use a tail counter, 32 void * entries, and a head counter, like so:

#define QUEUE_ELEMS 32
struct queue {
    unsigned int head;
    unsigned int prod_waiting;
    unsigned int lock;
    void *elems[QUEUE_ELEMS];
    unsigned int tail;
    unsigned int cons_waiting;

The head and tail counters are free running to avoid the empty-or-full problem, and the prod_waiting and cons_waiting are for a future implementation which actually does sleep and wakeup (I spin for my current tests).

The simplest implementation is for both producers and consumers to grab the lock, do their work, then drop the lock.  On my 32-bit x86 dual core 2 HT laptop, with 1 producer on cpu0 and 1 producer on cpu1 (ie. two hyperthreads of same core), it takes about 179 usec to enqueue and dequeue each element (but hugely variable, from 73 to 439 ns).  You can see that (as expected) the 2 and 3 producers cases are quite expensive, though not so bad if there are 2 producers and 2 consumers.

Lockless dequeue is quite easy:

  1. Read tail counter, then read head counter (order matters!)
  2. If it’s empty, wait until head changes).
  3. Grab entry[tail % 32].
  4. Try to compare and swap the tail to tail+1.  If not, we raced, so goto 1.

But lockless insert is harder, so I asked Paul McKenney who detailed a fairly complex scheme involving two offsets and some subtlety on both production and consumption side, and ended with “Now, are you -sure- locking is all that bad?  ;-)”.  So I went for a big lock around insertion to begin with.  It’s generally a little better, particularly for the common case of a single consumer and a single producer.

It’s worth noting that if you know you’re the only producer, you can skip the locks so I re-ran the benchmarks with a “queue_add_excl” implementation for the single-producer cases, as seen on the right.

You can similarly simplify the single consumer case, though it makes little difference in my tests.

However, you can do better than a straight naive lock: you can use the lower bit of the head counter to exclude other producers.  This means a production algorithm like so:

  1. Read head.  If it’s odd, wait.
  2. Read tail.
  3. If queue is full, wait for tail to change, then goto 1.
  4. Compare and swap head to head + 1; if it fails, go to 1.
  5. Store the element.
  6. Then increment the head.

For simplicity, I made the tail counter increment by 2 as well, and the consumer simply ignores the bottom bit of the head counter.  Avoiding a separate atomic operation on a “prod_lock” word seems to pay off quite well.

Finally, it’s worth noting that neither the exclusive producer nor exclusive consumer cases win much any more, so I can delete those altogether.

Before tying this into antithread, there are several things to do:

  1. Re-audit to make sure the barriers are correct.
  2. Test on PowerPC (always good for finding missing barriers).
  3. Add in a decent notification mechanism, ie. futexes or falling back to pipes.

And that’s next…

Kernel Compilation Times

David S. Miller complains that CONFIG_MODULE_SIG slows down builds, and he does hundreds of allmodconfig builds every day.

This complaint falls apart fairly quickly in the comments; he knows he can simply turn it off, but what about others who he simply tells to test with allmodconfig?  One presumes they are not doing hundreds of kernel builds a day.

linux-next had the same issue, and a similar complaint; I had less sympathy when I suggested they might want to also turn off CONFIG_DEBUG_INFO if they were worried about compile speed, and indeed, found out Stephen already did.  Now they turn off CONFIG_MODULE_SIG, too.

Here are some compile times on my i386 laptop, using v3.7-rc1-1-g854e98a as I turn options off:

  • allmodconfig: 52 minutes
  • … without CONFIG_MODULE_SIG: 45 minutes
  • … without CONFIG_DEBUG_INFO: 40 minutes
  • … without CONFIG_KALLSYMS: 37 minutes
  • … using -O1 instead of -Os: 24 minutes (not a good idea, since we lose CONFIG_DEBUG_STRICT_USER_COPY_CHECKS).

In summary, the real problem is that people don’t really want ‘allmodconfig’.  They want something which would compile a kernel with as much coverage as possible with no intention of actually booting it; say ‘allfastconfig’?

Latinoware 2012

I’m keynoting at http://latinoware.org in Brazil in two weeks (assuming I get my visa in time! ).  Looking forward to my first trip to South America as well as delivering a remix of some of my favourite general hacking talks. And of course, catching up with maddog!

What Can I Do To Help?

Enthusiasm is a shockingly rare resource, anywhere. The reason enthusiasm is a rare resource is because it’s fragile; I’ve seen potentially-great ideas abandoned because the initial response was a liturgy of reasons why it won’t work.  It’s not the criticism which kills, it’s the scorn.

So when someone emails or approaches you with something they’re excited about, please reply thinking “What can I do to help?”  Often I just provide an encouraging and thoughtful response: a perfectly acceptable minimum commitment.  If you offer pointers or advice, take extra care to fan that delicate flutter of enthusiasm without extinguishing it. Other forces will usually take care of that soon enough, but let it not be you.

FAQ: CCAN as a shared library?

This was asked of me again, by Adam Conrad of Canonical: “Why isn’t CCAN a shared library?”.  Distributions prefer shared libraries, for simplicity of updating (especially for security fixes), so I thought I’d answer canonically, once.

  • Most CCAN modules are small; many are just headers.
  • You can’t librify something which doesn’t have a stable API or ABI.
  • CCAN’s alternative is not a library, it’s cut-n-paste.

To illustrate what I mean, consider ccan/hash: it’s a convenient wrapper around the Jenkins lookup3 code.  It could be a library, but in practice it’s not.  For something as simple as that, people cut & paste the lookup3 code into their own.  It already exists in two places in Samba, for example.  It’s this level of code snippet which is served beautifully by CCAN: you drop it in a ccan/ dir in your project and you get nice, documented and tested code, with optional updates if you want them later.

You could still make ccan/hash into a shared library.  But if the upstream doesn’t do it for you, you have to check the ABI and update the version number every time it changes.  This, unfortunately, means you can no longer share it: if library A uses version 11 and library B uses version 12, you can’t link against both library A and library B.  Yet there’s nothing wrong with either: you have to change them because you librified it.

This kind of pain isn’t worth it for small snippets of code, so people cut & paste instead, and that makes our code worse, not better.  That’s what CCAN tries to fix.

Now, there may one day be modules which could be shared libraries: that’s a good thing, if the maintainer is prepared to support the ABI and API.  I’m not going to kick a module out of CCAN for being too successful.  But I’d like to explicitly label such a module, and make sure ccanlint does the appropriate checks for ABI compatibility and symbol hiding.

1 Week to Go, and Rusty Goes Offline

Just as the Linux kernel merge window closes, I’m going offline.  My wedding is exactly a week away, but I’ll be entertaining guests and doing final preparation.  I’ll be back from our honeymoon and wading through mail on the 7 May.

Alex’s “A Bald Target” campaign to raise awareness for TimeForKids has been a huge success, even though we’re currently far short of the hair-shaving goal.  She’s been on one of the local radio stations, with newspaper coverage expected this weekend; two local TV stations want to cover the actual shave if it happens.  The charity is delighted with the amount of publicity they have received; given that they need local people to volunteer to mentor the disadvantaged children, that’s worth at least as much as the money.

Special thanks to a couple of people who donated direct to the charity, to avoid causing baldness!  And yes, if we were starting again, having competing “shave” vs “save” campaigns would have been awesome…

Sources of Randomness for Userspace

I’ve been thinking about a new CCAN module for getting a random seed.  Clearly, /dev/urandom is your friend here: on Ubuntu and other distributions it’s saved and restored across reboots, but traditionally server systems have lacked sources of entropy, so it’s worth thinking about other sources of randomness.  Assume for a moment that we mix them well, so any non-randomness is irrelevant.

There are three obvious classes of randomness: things about the particular machine we’re on, things about the particular boot of the machine we’re on, and things which will vary every time we ask.

The Machine We’re On

Of course, much of this is guessable if someone has physical access to the box or knows something about the vendor or the owner, but it might be worth seeding this into /dev/urandom at install time.

On Linux, we can look in /proc/cpuinfo for some sources of machine info: for the 13 x86 machines my friends on IRC had in easy reach, we get three distinct values for cpu cores, three for siblings, two for cpu family, eight for model, six for cache size, and twelve for cpu MHz.  These values are obviously somewhat correlated, but it’s a fair guess that we can get 8 bits here.

Ethernet addresses are unique, so I think it’s fair to say there’s at least another 8 bits of entropy there, though often devices have consecutive numbers if they’re from the same vendor, so this doesn’t just multiply by number of NICs.

The amount of RAM in the machine is worth another two bits, and the other kinds of devices eg. trolling /sys/devices, which can be expected to give another few bits, even in machines which have fairly standard hardware settings like laptops.  Alternately, we could get this information indirectly by looking at /proc/modules.

Installed software gives a maximum three bits, since we can assume a recent version of a mainstream distribution.  Package listings can also be fairly standard, but most people install some extra things so we might assume a few more bits here.  Ubuntu systems ask for your name to base the system name on, so there might be a few bits there (though my laptop is predictably “rusty-x201”).

So, let’s have a guess at 8 + 7 + 2 + 3 + 3 + 2 + 2, ie. 27 bits from the machine configuration itself.

Information About This Boot

I created an upstart script to reboot (and had to hack grub.conf so it wouldn’t set the timeout to -1 for next boot), and let it loop for a day: just under 2000 times in all. I eyeballed the graphs of each stat I gathered against each other, and there didn’t seem to be any surprising correlations.   /proc/uptime gives a fairly uniform range of uptime values within a range of 1 second, at least 6 bits there (every few dozen boots we get an fsck, which gives a different range of values, but the same amount of noise).  /proc/loadavg is pretty constant, unfortunately.  bogomips on CPU1 was fairly constant, but for the boot CPU it looks like a standard distribution within 1 bogomip, in increments of 0.01: say another 7 bits there.

So for each boot we can extract 13 bits from uptime and /proc/cpuinfo.

Things Which Change Every Time We Run

The pid of our process will change every time we’re run, even when started at boot.  My pid was fairly evenly divided on every value between 1220 and 1260, so there’s five bits there.  Unfortunately on both 64 and 32-bit Ubuntu, pids are restricted to 32768 by default.

We can get several more bits from simply timing the other randomness operations.  Modern machines have so much going on that you can probably count on four or five bits of unpredictability over the time you gather these stats.

So another 9 bits every time our process runs, even if it’s run from a boot script or cron.


We can get about 50 bits of randomness without really trying too hard, which is fine for a random server on the internet facing a remote attacker without any inside knowledge, but only about five of these bits (from the process’ own timing) would be unknown to an attacker who has access to the box itself.  So /dev/urandom is still very useful.

On a related note, Paul McKenney pointed me to a paper (abstract, presentation, paper) indicating that even disabling interrupts and running a few instructions gives an unpredictable value in the TSC, and inserting a usleep can make quite a good random number generator.  So if you have access to a high-speed, high-precision timing method, this may itself be sufficient.