More Realistic Hashing: Cache Sensitivity Part II

So my previous post showed a trick with stuffing extra bits into the pointers of a simple hash table.  In most real situations, we want the hash to still be useful as the number of elements increases: the simple hash I used will slow down dramatically as we pass 75% utilization.  In addition, we want to be able to delete entries, which can be slightly painful in a linear hash.

Since then, I’ve been trying a few different approaches: in fact, I spent quite a while down this rabbit hole before digging myself out.  Here are the approaches I benchmarked (on a 64-bit system):

  1. Doubling: Linear hash with extra hash bits in the pointers, doubling and rehashing everything when it hits > 5/8 full.
  2. Cacheline Tree: Divide the hash into 128-byte groups.  Do linear collision resolution within each group, and when a group fills explode it into 16 pointers to separate hash tables of N 128-byte groups.  When that fills, explode again, etc.  In practice, N was 2.
  3. Cacheline Tree Then Linked:  Explode the top level like the above, but allocate a single cacheline (N=1) for the second level and just fill the second level from first bucket onwards.  If that fills, simply chain on another cacheline.  This actually freed up on delete as chains were emptied.
  4. Cacheline Linked: Even simpler: when the toplevel cacheline fills, just chain another cacheline off it.  Each cacheline is simply filled in order, but we only put 15 pointers in each cacheline and use the extra 4 bits per entry for additional hash bits (to reduce comparisons).  This also freed them up on delete.
  5. Chaining: Double linked list through each object: when a bucket fills we simply append the new entry to the existing ones.

Then I benchmarked a 2M-bucket (actually 2097152) hashtable of each type, increasing the number of entries by a factor of 2 at a time from 1M.  The good thing about visiting the Canberra office is I could use one of the  older Power machines here with gobs of memory.

The first three tests were very similar to the last post, except I used an array of scrambled pointers so I wasn’t accessing the objects in memory order.  The last two are new:

  1. Insert all the objects into the hash.
  2. Find all the objects in the hash.
  3. Do the same number of missed searches in the hash.
  4. Delete half the objects, then time reinserting them (steady-state test which avoids any allocation penalties for expansion).
  5. Delete every object from the hash.

Insert

Our simple link of cachelines (CLT Linked) gets so slow I took it out of the benchmark after 8x overload. Chained is the winner: we expect it to modify the inserted entry, the hash bucket and the current top entry for a total of three cache misses (and the object being inserted would generally be hot). It also never needs to (re)alloc or rehash.

We can see the CLT-then-linked case is getting worse as we fall into the chain and have to linear search for the entry to place into; there are 16 times as many of these chains as in the dumb “link of cache lines” case, because we have a link off each pointer off each toplevel hash entry, rather than one off each cacheline.

If we look at reinserts, we can remove the realloc and rehash overhead:

In steady-state we can see the others are more competitive, though even our flat linear hash loses a little as it scans for the next NULL bucket.

Memory Usage


The optimal reasonable case would be one pointer per object; 8 bytes. The two linked cases quickly approach this (at which point they suck). The chaining cases uses two pointers; doubling uses 2 pointers for these values too (it can do slightly better just before it doubles, and worse just afterwards).

The real story is the cacheline tree. It hits a tipping point as buckets fill: that 512-byte second level off each top level pointer means we can spend 131 bytes per object just after we fill the top level. At 8 million objects in our 1 million buckets we are in fact spending 131 bytes per object (vs 17 for the chained case).

This is why I reduced the second level hash size to two cachelines; and we still get that factor of 16 explosion on each expansion. In fact, the chaining hashes were an attempt to find a “good enough” variant which didn’t explode like this. The chaining variant (CLT then linked) explodes once, to 65 bytes per entry, but then expands linearly.

Lookups

This is where the cache damage of chaining becomes apparent. Doubling performs best, and the tree is close behind. Even a linked list of cachelines beats chaining: with 15 entries per cacheline on average 13 of those will not need to be examined (since the lower pointer bits where we’ve stashed another 3 hash bits will only accidentally match for two of them). This means 1/5 of the cache misses of the chained case for really large overload values.

By 64 entries per bucket (128M objects) chaining is 16 times slower than doubling for a successful lookup, and 20 times slower for an unsuccessful one. Even at 8 per bucket, it matches our terrible chained case at being 3 times slower than the linear case.

Since I assume most hashes are lookup heavy, this result should be weighted more than add or delete.

Delete

Chained delete is cheap: you might get three cache misses (one for the object you’re deleting, and one for the next and prev objects in the linked list) but you don’t need to do a hash lookup at all. The linear hash has a cache miss to cache the object, then a cache miss to do the delete, but then it has to rehash any objects in immediately following hash buckets. It’s 2.2 times slower at 64x overload. We could use a special value to avoid this.

Cacheline tree does OK, but still needs to hash and similarly reseat any “runs” within the bottom cacheline. At three levels, add cache misses for the second level and third levels. 3.1 times slower than chained at 64x overload.

Conclusion

I’d love to spend more time refining these results. In particular, there are ways to steal more bits to store extra hashing, ways to avoid reseating entries on linear deletes, ways to avoid rehashing to double the linear case. Compression of pointers is also possible, but it gets messy since there are a no longer a fixed number of entries per cacheline. I might get time for some of this, but at some point you have to get on with less abstract work!

But the takeaway is this: it’s hard to beat the simplicity of a big hash array which doubles and stashes a few extra hash bits. If your workload is add/delete heavy, use chaining. If you are worried about unknown work sizes, use doubling. Or do what Perl does, and do both (though you can’t use the extra hash bit stowing with chaining).

The cache tree idea is not best at anything, unless you can’t double for some reason (the Linux kernel probably has this issue, and TDB may or may not). Continue reading “More Realistic Hashing: Cache Sensitivity Part II”

Hash Tables: In A Cache-Sensitive World? (FIXED)

(FIXED: my “NoMatched” bench was bogus, and actually the same as “Matched” case. Graphs changed, but conclusions the same.)
Everyone loves hash tables; certainly the networking code in the kernel uses them heavily and I do too.  But with more work on TDB (which is basically a hash table in a file), I’ve been re-measuring some of the fundamentals.

We all know that hash tables are cache-unfriendly; a good hash table means you’re hitting a “random” bucket every time.   Back in my university days this was never mentioned (caches were new and fancy things); the optimal hash table was seen as a prime-numbered sequence of buckets using a primary hash then a secondary hash to jump to the next bucket if this was full.  This gives fewer lookups than a naive linear search “if this bucket is full, go to the next one”, but does this override the cache unfriendliness of the approach?

These days it’s normal to use a fast universal hash and a power-of-two for the hash table size (Perl 5 and the Linux kernel do this, though Perl uses another weaker hash also by Jenkins).  Also, we deal with hash collisions by chaining off entries rather than using other hash buckets.  This has two main advantages: (1) it doesn’t barf when the hash table fills up, and (2) you can easily delete an entry from the hash table.  Unfortunately, this is pretty cache-unfriendly too, compared to a linear bucket fallback, but can we measure it?

To this end, I wrote a simple example program which filled each kind of hash table (linear hash bucket, secondary hash bucket and chained hash buckets) to 50%, then ran successful lookups, then ran failed lookups.  64 bit machine, and I chose 2^27 as my hash size giving a 1G hash table.  While most hashes are not this large, it’s useful to approximate a situation where other things are also using the cache.

This result surprised me, so I fiddled a bit and the basic shape remained; the differences vanished if the table was (say) only 10% full, but even at 25% we see chaining win. Here it’s about 15% faster on matched lookups (5% faster on misses).

What does this mean? It looks like linear edges over secondary for insertion, but for lookup it’s a wash.  But both are overshadowed by chaining, which doesn’t penalize other buckets when there’s a hash clash.  Despite the fact that objects are one pointer larger in the chaining case, it still wins (the test code uses two 64-bit values as the basic object).

But Now For The Twist…

The reason I was looking at old-school non-chained hashing schemes was because I have a new trick for them: stashing some hash bits in the bottom bits of the pointer.  This allows you to skip the dereference 7/8 times (assuming 64-bit-aligned pointers).

So, we add that trick to the linear case:

Now we can see the potential of not chaining through hash entries! It’s a full 25% faster than the normal linear case, and 10% faster than chaining (almost 20% faster in the missed case). Of course, for real life we need to handle delete and hash table filling. And that’s the subject of my next post, once I’ve finished coding it!

Linux Users of Victoria: Tuesday 6th April

Several months ago Donna Benjamin convinced me to speak at LUV, and now the time is approaching.  My intent is to give a talk about bits and pieces, and rely on questions to see what the audience is interested in.  Pretty casual and should be heaps of fun if you’re nearby.

I will also be taking the chance to drop by Lonely Planet; one of the places we stayed in Lithuania had seen an influx of foreign tourists since being featured in their guide, so they forced a note and a bottle of their vodka on us to take to LP.  Naturally, we Australians all know each other, so I couldn’t turn them down :)

Followup: lrzip

Mikael noted in my previous post that Con Kolivas’s lrzip is another interesting compressor.  In fact, Con has already done a simple 64-bit enhance of rzip for lrzip, and on our example file it gets 56M vs 55M for xz (lrzip in raw mode, followed by xz, gives 100k worse than just using lrzip: lrzip already uses lzma).

Assuming no bugs in rzip, the takeaway here is simple: rzip should not attempt to find matches within the range that the backend compressor (900k for bzip2 in rzip, 32k for gzip, megabytes for LZMA as used by lrzip).  The backend compressor will do a better job (as shown by similar results with lrzip when I increase the hash array size so it finds more matches: the resulting file is larger).

The rzip algorithm is good at finding matches over huge distances, and that is what it should stick to.  Huge here == size of file (rzip does not stream, for this reason).  And this implies only worrying about large matches over huge distances (the current 32 byte minimum is probably too small).  The current version of rzip uses an mmap window so it never has to seek, but this window is artificially limited to 900MB (or 60% of mem in lrzip).   If we carefully limit the number of comparisons with previous parts of the file, we may be able to reduce them to the point where we don’t confuse the readahead algorithms and thus get nice performance (fadvise may help here too) whether we are mmaped or seeking.

I like the idea that rzip should scale with the size of the file being compressed, not make assumptions about today’s memory sizes.  Though some kind of thrash detection using mincore might be necessary to avoid killing our dumb mm systems :(

xz vs rzip

As the kernel archive debates replacing .bz2 files with .xz, I took a brief glance at xz. My test was to take a tarball of the linux kernel source (made from a recent git tree, but excluding the .git directory):

     linux.2.6.tar 395M

For a comparison, bzip2 -9, rzip -9 (which uses bzip2 after finding distant matches), and xz:

     linux.2.6.tar.bz2 67M
     linux.2.6.tar.rz 65M
     linux.2.6.tar.xz 55M

So, I hacked rzip with a -R option to output non-bzip’d blocks:

     linux.2.6.tar.rawrz 269M

Xz on this file simulates what would happen if rzip used xz instead of libbz2:

     linux.2.5.tar.rawrz.xz 57M

Hmm, it makes xz worse!  OK, what if we rev up the conservative rzip to use 1G of memory rather than 128M max?  And the xz that?

     linux.2.6.tar.rawrz 220M
     linux.2.6.tar.rawrz.xz 58M

It actually gets worse as rzip does more work, implying xz is finding quite long-distance matches (bzip2 won’t find matches over more than 900k).  So, rzip could only have benefit over xz on really huge files: but note that current rzip is limited on filesize to 4G so it’s a pretty small useful window.

Code review: libreplace

libreplace is the SAMBA library (also used in ctdb) to provide working implementations of various standard(ish) functions on platforms where they are missing or flawed.  It was initially created in 1996 by Andrew Tridgell based on various existing replacement hacks in utils.c (see commit 3ee9d454).

The basic format of replace.h is:

    #ifndef HAVE_STRDUP
    #define strdup rep_strdup
    char *rep_strdup(const char *s);
    #endif

If configure fails to identify the given function X, rep_X is used in its place.  replace.h has some such declarations, but most have migrated to the system/ include directory which has loosely grouped functions by categories such as dir.h, select.h, time.h, etc.  This works around the “which header(s) do I include” problem as well as guaranteeing specific functions.

Other than reading this code for a sense of Unix-like paleontology (and it’s so hard to tell when to remove any of these helpers that cleanups are rare) we can group replacements into three categories:

  1. Helper functions or definitions which are missing, eg. strdup or S_IRWXU.
  2. “Works for me” hacks for platform limitations, which make things compile but are not general, and
  3. Outright extensions, such as #define ZERO_STRUCT(x) memset((char *)&(x), 0, sizeof(x)) or Linux kernel inspired likely()

Since it’s autoconf-based, it uses the standard #ifdef instead of #if (a potential source of bugs, as I’ve mentioned before).  I’ll concentrate on the insufficiently-general issues which can bite users of the library, and a few random asides.

  • #ifndef HAVE_VOLATILE ?  I can’t believe Samba still compiles on a compiler that doesn’t support volatile (this just defines volatile away altogether)  If it did no optimizations whatsoever, volatile might not matter, but I’m suspicious…
  • typedef int bool; is a fairly common workaround for lack of bool, but since pointers implicitly cast to bool but can get truncated when passed as an int, it’s a theoretical trap. ie. (bool)0x1234567800000000 == true, (int)0x1234567800000000 == 0.
  • #if !defined(HAVE_VOLATILE) is the same test as above, repeated. It’s still as bad an idea as it was 186 lines before :)
  • ZERO_STRUCT, ZERO_ARRAY and ARRAY_SIZE are fairly sane, but could use gcc extensions to check their args where available. I implemented this for ARRAY_SIZE in the Linux kernel and in CCAN. Making sure an arg is a struct is harder, but we could figure something…
  • #define PATH_MAX 1024 assumes that systems which don’t define PATH_MAX probably have small path limits. If it’s too short though, it opens up buffer overruns. Similarly for NGROUPS_MAX and PASSWORD_LENGTH.
  • The dlopen replacement is cute: it uses shl_load where available (Google says HPUX), but dlerror simply looks like so:
        #ifndef HAVE_DLERROR
        char *rep_dlerror(void)
        {
         	return "dynamic loading of objects not supported on this platform";
        }
        #endif

    This cute message for runtime failure allows your code to compile, but isn’t helpful if dlopen was a requirement. Also, this should use strerror for shl_load.

  • havenone.h is (I assume) a useful header for testing all the replacements at once: it undefines all HAVE_ macros. Unfortunately it hasn’t been updated, and so it isn’t complete (unused code is buggy code).
  • inet_pton is credited to Paul Vixie 1996. It’s K&R-style non-prototype, returns an int instead of bool, and doesn’t use strspn ((pch = strchr(digits, ch)) != NULL) or (better) atoi. But it checks for exactly 4 octets, numbers > 255, and carefully doesn’t write to dst unless it succeeds. I would have used sscanf(), which wouldn’t have caught too-long input like “1.2.3.4.5”. OTOH, it would catch “127…1” which this would allow. But making input checks more strict is a bad way to be popular…
  • Tridge’s opendir/readdir/telldir/seekdir/closedir replacement in repdir_getdents.c is a replacement for broken telldir/seekdir in the presence of deletions, and a workaround for (older?) BSD’s performance issues. It is in fact never used, because the configure test has had #error _donot_use_getdents_replacement_anymore in it since at least 2006 when the Samba4 changes were merged back into a common library!
  • repdir_getdirents.c is the same thing, implemented in terms of getdirents rather than getdents; it’s still used if the telldir/delete/seekdir test fails.
  • replace.c shows some of the schizophrenia of approaches to replacement: rep_ftruncate #errors if there’s no chsize or F_FREESP ioctl which can be used instead, but rep_initgroups returns -1/ENOSYS in the similar case. Best would be not to implement replacements if none can be implemented, so compile will fail if they’re used.
  • rep_pread and rep_pwrite are classic cases of the limitations of replacement libraries like this. As pread is not supposed to effect the file offset, and file offsets are shared with children or dup’d fds. There’s no sane general way to implement this, and in fact tdb has to test this in tdb_reopen_internal. I would implement a read_seek/write_seek which are documented not to have these guarantees. I remember Tridge ranting about glibc doing the same kind of unsafe implementation of pselect :)
  • snprintf only rivals qsort-with-damn-priv-pointer for pain of “if only they’d done the original function right, I wouldn’t have to reimplement the entire thing”. I’ll avoid the glibc-extracted strptime as well.

I’m not sure Samba compiles on as many platforms as it used to; Perl is probably a better place for this kind of library to have maximum obscure-platform testing. But if I were to put this in CCAN, this would make an excellent start.

Code Reviewing: popt

I decided that every day I would review some code in ctdb.  I never spend enough time reading code, and yet it’s the only way to really get to know a project.  And I decided to start easy: with the popt library in ctdb.

popt is the command-line parsing library included with the RPM source, and used by SAMBA.  I vaguely recall Tridge telling me years ago how good it was.  I started with the man page, and it is an excellent and useful library: it does enough to make the simple cases less code that getopt_long, yet allows flexibility for complex cases.

The idea is simple: similar to getopt_long, you create an array describing the possible options.  Unlike getopt, however, simple integers and flags are handled automatically.  So you set up the context with the commandline, then hand that context to poptGetNextOpt() to do the parsing.  That keeps parsing args until it hits an error or you’ve marked an argument to return a value for special handling.  The manual page is excellent and made me feel really comfortable with using the code.

Now, the code itself.  Even before you notice the four-space tabs and bumpyCaps, you see the lint annotations and docbook-style comments cluttering the source.  It does make the code annoying to read, breaking CCAN’s Golden Rule.  Typedefs of structs, typedefs to pointers, and several #ifdef  __cplusplus complete my minor gripes.

The code is old and cross-platform: the header test for features we’d simply assume in a modern Linux.  It has a simple set of bitmap macros (taken from pbm, judging from the name), a helper routine to find an executable in $PATH (using alloca!) .  These are the kinds of things which would be in separate modules, were this in CCAN.

The definition of _free() to be a (redundantly-NULL-taking) free() which always returns NULL is used to effect throughout the code:

    defs = _free(defs);

Here is a trick I hadn’t seen before to zero an onstack var, and really don’t recommend:

poptDone done = memset(alloca(sizeof(*done)), 0, sizeof(*done));

The help-printing code is in a separate file, popthelp.c.  It’s actually about half 1/3 of the code!  That’s mainly because it takes pains to pretty-print, and it’s done by manually tracking the column number through the routines (aka ‘cursor’).  These days we’d asprintf into a temp buffer and strlen() to figure out if we should start a new line and indent before printing this.  Or even better, create a struct with the FILE * and the column number, and create a fprintf variant which updated the column number and figured out wrapping for us. Routines like maxArgWidth() which iterates the table(s) to figure how much to indent will still suck however: probably simplest to build all the strings in the help table in memory and then dump at the end.

I thought I found a buffer overrun, but a test program disproved it: the singleOptionHelp() uses its maxLeftCol (plus some extra stuff) to size a buffer.  This works because maxLeftCol is previously calculated as the worst-case size of the argument help.  Now, the code is unclear (AFAICT maxLeftCol must always be sufficient, so the extra bytes are wasted), but not buggy.

In summary, this is excellent code.  Talloc would help it, as would modern printf usage (%.*s for example), but generally the code is in really good shape.  I know that the popt code in RPM is a little updated, but I can’t imagine that much has changed in such an obviously-stable codebase. The temptation to rewrite it is thus very low: who knows what the testsuite would miss?

linux.conf.au 2010

After slightly disastrous preparation  (my left knee in a brace as detailed for the perversely-curious here) my week went well.  I tried to get back to my hotel room early each evening to rest as per doctor’s orders (and managed it except Friday night), but polishing my Friday talk tended to take a few hours every day.

Sunday

The Newcomer’s Session was well attended, but Jacinta and I were slack with preparation so it was unbalanced for my tastes.  I raced to the post-session pub assuming my crutches would ensure I’d be the trailer, to find that I was wrong.  It would have been better to explicitly and immediately drag people towards the pub, because that’s (seriously) the most important part of the introduction to LCA.

Monday

Miniconf time, and I started in the Open Programming Languages miniconf.  There was some interestingly random language stuff there: it’s one of my few opportunities to get exposure to higher level languages.  The miniconf talks were enthusiastic and unpolished as such things are supposed to be.  Haskell, and all the wonderful things it doesn’t let you do by Stephen Blackheath was interesting,  but lacked solid examples. Introducing Gearman — Distributed server for all languages by Giuseppe Maxia was a great short intro into an unknown project. vcs-pkg.org by Martin F. Krafft was classic work-in-progress talk with insights into a mundane but critical infrastructure problem (standards and practices for coordinating upstream and across distributions using distributed revision control).

Die Flash Die – SVG has arrived by Andy Fitzsimon gave classic bling talk with a message about the animation potential for SVG.  Useful content, too, for those dealing with this, and I was blown away to hear of Gordon, a FOSS Flashâ„¢ runtime written in JavaScript.

How to Use FOSS Graphics Tools to Pay for College by Elizabeth Garbee was an insight into the US education system and a chance to find out what my friend Edale (I know she hates that meme!) was doing.  But her talk didn’t quite gel for this audience. Unfortunately using the words “did you spot the head-fake?” riles me.  You are suddenly telling me that you’ve been using your intellect to compete with me rather than to inform and enrich me.

Then came my own Talloc: Pick Up Your Own Garbage! talk, which was delayed by my miscalculation of transit time on crutches. A mediocre rehash of my previous talloc talks, but I wanted to put it in front of this group because it really offers fresh view into a program’s data structures at runtime.

Writing Facebook Applications in Perl by Paul Fenwick was a nice little introduction to the FB API from a Perl point of view, but he kept his powder dry for his awesome lightening talk on Friday.

Tuesday

I peered in at the tail end of the keynote which was apparently excellent.  I woke a little early then did some more work on my presentation, and by the time I had breakfast I was incurably late. One person admitted to me that they watched the live stream from their hotel room, but I wasn’t that clever.

This this day was all hallway track for me, catching up with many people I haven’t seen since last year. Then the Speaker’s Dinner at Museum of New Zealand: Te Papa Tongarewa. This is also a fun time to chat with everyone, but I was disappointed that my crutches forced me to forgo learning a traditional Haka. It was also the first chance to greet the two chief organizers, who had been sick the first two days of the conference.  Edale and I also had fun playing with their very-past-bedtime hyper 2 yo Brooke (until we were told not to stir her up any more!)

Wednesday

The keynote by Benjamin Mako Hill was a little chaotic but made his point about antifeatures well: how such things are only viable when consumers don’t have freedom of choice (in particular, ability to examine, modify and share the software they’re using).  I then headed to Introduction to game programming by Richard Jones, where I struggled with pyglet before giving up and paying half-attention.  I did learn some things though, and everyone who was active seemed to get great satisfaction from the tutorial.

Open Sourcing the Accountants by Jethro Carr lacked density.  It takes a great deal of work to give a thorough comparison of different accounting packages, and his insights into how accountants think were insufficient to make that the backbone of his talk either.

subunit: Testing across boundaries for fun and profit by Robert Collins was slightly familiar ground for me, but as libtap maintainer he wanted me to attend.  It was a good bread-and-butter talk, which perhaps could have benefited from a few more snazzy real-life examples (making testing sexy is hard though!).  He semi-seriously suggested I should take over the C output implementation for subunit; still thinking…

I caught the questions at Teaching FOSS at universities by Andrew Tridgell and Robert (Bob) Edwards, which I will watch seriously once the videos are uploaded.

Then was one compulsory-attendance presentation of the week: The World’s Worst Inventions by Paul Fenwick. I had made a comment to Paul earlier in the week that I was concerned that my talk lacked substance.  His reply was “I won’t comment how much substance is in my talk”.  And any conclusions were left to the minds of the audience as full-costumed Paul took us through a series of invention disasters.  I teased him about it later, but let’s be honest: if I could present like that I wouldn’t have worried about content either!

That evening was the HackOff.  I’ve never tried competitive programming, so when we came up with the plan of a SAMBA team, I heartily endorsed it :)  Intimidation is important at these events, and the tweet from Adam Harvey was promising: At the #lca2010 HackOff. There’s a table with Rusty, Tridge, Anthony Towns and Andrew Bartlett. We’re fucked. However, despite having the largest team (with 6 of us), we only just squeaked in by 2 minutes.  Subtract any one of the team and we wouldn’t have won, though with fewer we might not have tried to brute-force the final question.

Thursday

Glyn Moody’s keynote was excellent. Then I lost some more hallway time before emerging in The Elephant in the Room: Microsoft and Free Software by Jeremy Allison. I thought it was a worthwhile and balanced presentation; of course it had a few cheap laughs in it, but the examination of Microsoft’s actions wrt FOSS is a worthwhile exercise if we want to assess their potential impact.

I was a bit late to Building a Xapian index of Wikipedia in less time than this talk takes by Olly Betts, but it was too unprepared for my tastes and I went in not knowing what Xapian was (though I picked it up from context). Tux on the Moon: FOSS hardware and software in space by Jonathan Oxer was good, but another one I was late to (15 minutes between talks seems to give me enough time to start conversations, but not enough to finish them).

Simplicity Through Optimization by Paul McKenney was a good talk if you didn’t know your RCU.  For me I would have liked to hear more what the various lines of code were doing (before they were excised by the optimized implementation).  But being deeply familiar with the theory and somewhat familiar with the practice, I’m probably in a minority.

By this stage I was exhausted, and Using Functional Programming Techniques In Your Favourite Language by Malcolm Tredinnick was in the same room so I stayed.  This talk was a disappointment to me (and, I think, Malcolm) because it didn’t quite contain the general insights he’d believed were there when he proposed the talk. Nice for me to get an refreshing exposure to functional programming though.

Dinner at an Indian restaurant with the SAMBA people, which meant I was right near the PDNS, so I dropped in briefly then returned to my hotel room for an early night.

Friday

Nat Torkington’s keynote contained the classic “heckle Rusty” factor and was delightfully punchy. He rolled over to a very very strong set of lightning talks; a format which works so well at these geek-rich events.  Paul Fenwick’s “Unfriendly” Facebook app was an awesome way to close.

Patent defence for free software by Andrew Tridgell (late again!) was familiar ground for me, but I wanted to see how he presented such a dry area.  Lots of text: I would have included some more diagrams (claim dependencies are well represented by a tree, for example). But the audience were rapt, so I’m clearly too picky!

Last minute prep for my talk: I decided the previous night that I would use Notes View, only to find that noone could get it to work.  Both notes and presentation appeared on the projector screen, fortunately as I was about to give up and run without notes, someone suggested I simply drag the notes view back onto my laptop screen!  Sometimes the low-tech approaches evade our over-complicated minds.

FOSS Fun With A Wiimote by Rusty Russell was well-received. I didn’t go as far with the project as I had intended, due to personal time contraints and time lost wrangling with actual hardware, but sometimes that’s instructive too.

The presentation itself was flawed in three places.  Firstly, my intro slide appeared all at once rather than clicking in one point at a time, destroying my carefully practiced routine at that point.  Secondly, noone knew what LED throwies were: (an open source graffiti technology developed at the Graffiti Research Lab) and I so that slide was completely lost.  Finally, I should have set up my replacement two-year-old on the stage where the audience and the cameras could see her clearly.

The closing announced Brisbane for lca2011, and I handed the virtual speakers’ gift to the organisers.  That done, I was ready to relax at the Penguin Dinner.  Most years I don’t even drink, knowing that I’ll have to do the auction.  But as there was no auction I sat next to Nat Torkington to guarantee great conversation and was ready to chill. I did some singing, didn’t try the Haga (again). I even got a cuddle with the organiser’s very well-behaved 5-month-old son Adam.

Unfortunately, events conspired against me and I was dragged into a pledging war for a prize I didn’t want to win (and at which my doctor would be aghast). I thought we could get more money from the Wenches For Winching, who were weasonably wealthy and weally wanted to win. Ted Ts’o had a similar idea. Unfortunately the prospect of crippled Rusty being “rescued” (after being dropped: that was the no way part) was too alluring for many, and I had to work hard to ensure I didn’t win.

A good time had by all, though exhausting after a long week.

Saturday

Briefly peered into the Open Day, which was buzzing with setup and opening, before heading home, spent. I did find out that wild weather had wuined the winching of wenches; but there is a standing offer when they find themselves in Wellington again.

Summary

Absolutely on par with previous awesome conferences; there were no organisational disappointments for me the entire week. I was particularly happy to see people digging in and fixing things when they were wrong instead of simply complaining.

A great achievement, everyone!

More Linux-next Graphing

Mikey blogs about linux-next workload with pretty graphs.  Ideally, we should all have our patches marshalled before the freeze, and there should be no pre-merge-window peak.  I’ve gotten close to this in recent times, by being lazy and being content to wait for the next cycle if I miss one. Rushing things in tends to imply we skimped on review and testing.

So on that basis 2.6.30 looks like a winner:  it has the smallest “peak of crap”.

If you want to actually measure how much work Stephen is having to do, you would have to figure out what changes he is seeing.  Approximately, that would be the changes in Linus’ tree that did not come from linux-next, plus any new work entering linux-next itself.  Anyone?