Rusty Russell's Coding Blog | Stealing From Smart People

I enjoyed Levitt & Dubner’s “Freakonomics”, and picked up the followup “Superfreakonomis” recently at an airport.  The last chapter, however, was astonishing.  The entire chapter was devoted to a glowing advertisement for Intellectual Ventures, pointing out that they own 20,000 patents “more than all but a few dozen companies in the world”, but of course “there is little hard evidence” that they are patent trolls.

But this bunch of wacky genius billionaires have solved global warming (much of which they dispute anyway) and can control malaria and prevent hurricanes from forming.  Unlike the rest of the book which covers analysis of well-known facts and disputes them with insightful economic research, this chapter is so breathless and gushy that it makes me question the rest of the author’s work.

I first came across Intellectual Ventures when The Economist reversed their 100-year opposition to patents, and the only reason I could find was a similarly cheerleading piece about this company.  (I had naively expected new research revealing some net positive of patents, or some such revelation).

Side note: when a respected information source covers something where you have on-the-ground experience, the result is often to make you wonder how much fecal matter you’ve swallowed in areas outside your own expertise.

So, what is IV actually doing?  Buying up loads of patents and licensing them to companies who calculate it’s not worth the fight is patent trolling 101.  Yet the scale they’re operating on puts them on new ground, and opens new opportunities.  It seems obvious to get corporate investors on board by promising them immunity from patent claims.  With enough patents you stop trying to license them one-by-one and just tax each industry at some non-negotiable rate.  No doubt they have more tricks I haven’t even thought of, but these potential devices really do make them a new breed of Super Trolls.

Their efforts to actually attain their own patents could simply be more of the same, but it’s also a relatively cheap but clever PR exercise (as shown by their media treatment).  This will help them when (legislative?) efforts are made to shut down patent trolls.  I’m fairly confident that they’ll simply license rather than implement anything themselves; actually producing things requires much more work, and simply exposes you to others’ patents.

Without diving deeply into this, they seem to understand two things clearly:

  1. They learnt from Microsoft that government-enforced monopolies are worth billions.  Microsoft had copyright on software, this is patents.
  2. Development is getting much cheaper, while patents are getting more valuable.   Cheaper development is shown clearly by free software, open hardware and hackerspaces.  Patent value increases as more of the world becomes a more profitable and enforceable patent target.

Now, I don’t really care if one company leeches off the others.  But if they want to tax software, they have to attack free software otherwise people will switch to avoid their patent licensing costs.  And if you don’t believe some useful pieces of free software could be effectively banned due to patent violations, you don’t think on the same scale as these guys.

No tags Hide

Jun/10

8

LWN Quotes Of The Week

I have been petitioning Jon Corbet to introduce a “Supporter” level subscription to LWN; I think given his failure to keep up with inflation and my need to experience conferences vicariously I feel deeply indebted to them.

That lead to me looking through LWN Quote of The Week; there’s no neat index of these things.  And I’m not going to create one, but I did troll for my old quotes while procrastinating, and here’s my vanity list:

  1. January 21, 2004
  2. March 14, 2007
  3. March 28, 2007
  4. March 5, 2008
  5. May 7, 2008
  6. December 10, 2008
  7. January 28, 2009
  8. February 11, 2009
  9. April 22, 2009
  10. May 26, 2010

From the comments, it looks like I got the very first one.  Wow.  I’m sure there are missing ones (I was sure I got two weeks in a row once).

I’m behind quota for 2010 though; I was surprised my recent fight with Linus didn’t get in.  I’ll try harder…

No tags Hide

A classic pattern in C is to hand a generic callback function around which takes a “void *priv” pointer so the function can take arbitrary state (side note: a classic anti-pattern is not to do this, resulting in qsort being reimplemented in Samba so one can be provided!).

The problem with this pattern is that it breaks type safety completely, such as in the following example:

int register_callback(void (*callback)(void *priv), void *priv);
static void my_callback(void *_obj)
{
struct obj *obj = _obj;
...
}
...
register_callback(my_callback, &my_obj);

If I change the type of my_obj, there’s no compiler warning that I’m now handing my callback something it doesn’t expect.

Some time ago, after such a change bit me, I proposed a patch to lkml to rectify this, using a typesafe callback mechanism. It was a significant change, and the kernel tends to be lukewarm on safety issues so it went nowhere. But these thoughts did evolve into CCAN’s typesafe_cb module.

The tricksiness…

More recently, I tried to use it in libctdb (the new async ctdb library I’ve been toying with), and discovered a fatal flaw. To understand the problem, you have to dive into how I implemented typesafe_cb. At its base is a conditional cast macro: cast_if_type(desttype, expr, oktype). If expr is of type “oktype”, cast it to “desttype”. On compilers which don’t support the fancy gcc builtins needed to do this, this just becomes an unconditional cast “(desttype)(expr)”. This allows us to do the following:

#define register_callback(func, priv) \
_register_callback(cast_if_type(void (*)(void *), (func), void (*)(typeof(priv)))

This says that we cast the func to the generic function type only if it exactly matches the private argument. The real typesafe_cb macro is more complex than this because it needs to ensure that priv is a pointer, but you get the idea.

Now, one great trick is that the callback function can take a “const” (or volatile) pointer of the priv type, and we let that work as well: we have a “cast_if_any” which extends “cast_if_type” to any of three types:

#define typesafe_cb(rtype, fn, arg) \
cast_if_any(rtype (*)(void *), (fn), \
rtype (*)(typeof(*arg)*), \
rtype (*)(const typeof(*arg)*), \
rtype (*)(volatile typeof(*arg)*))

The flaw…

If your private arg is an undefined type, typeof (*arg) won’t work, and you need this to declare a const pointer to the same type. I have just filed a bug report, but meanwhile, I need a solution.

The workarounds…

Rather than use cast_if_any, you can insert an explicit call to the callback to evoke a warning if the private arg doesn’t match, then just cast the callback function. This is in fact what I now do, with an additional test that the return type of the function exactly matches the expected return type. cast_if_type() now takes an extra argument, which is the type to test:


#define typesafe_cb(rtype, fn, arg) \
cast_if_type(rtype (*)(void *), (fn), (fn)(arg), rtype)

cast_if_type does a typeof() on (fn)(arg), which will cause a warning if the arg doesn’t match the function, and the cast_if_type will only cast (fn) if the return type matches rtype. You can’t test the return type using a normal test (eg. “rtype _test; sizeof(test = fn(arg));”) because implicit integer promotion makes this compile without a warning even if fn() returns a different integer type.

Unfortunately, the more general typesafe_cb_preargs() and typesafe_cb_postargs() macros lose out. These are like typesafe_cb but for callbacks which take extra arguments (the more common case).


/* This doesn't work: arg might be ptr to undefined struct. */
#define typesafe_cb_preargs(rtype, fn, arg, ...) \
cast_if_any(rtype (*)(__VA_ARGS__, void *), (fn), \
rtype (*)(__VA_ARGS__, typeof(arg)), \
rtype (*)(__VA_ARGS__, const typeof(*arg) *), \
rtype (*)(__VA_ARGS__, volatile typeof(*arg) *))

We can’t rely on testing an indirect call: we’d need example parameters to pass, and because they’d be promoted. The direct call might work fine but an indirect call via a different function signature fail spectacularly. We’re supposed to increase type safety, not reduce it!

We could force the caller to specify the type of the priv arg, eg. “register_callback(func, struct foo *, priv)”. But this strikes me as placing the burden in the wrong place, for an issue I hope will be resolved soonish. So for the moment, you can’t use const or volatile on callback functions:


/* This doesn't work: arg might be ptr to undefined struct. */
#define typesafe_cb_preargs(rtype, fn, arg, ...) \
cast_if_type(rtype (*)(__VA_ARGS__, void *), (fn), (fn), \
rtype (*)(__VA_ARGS__, typeof(arg)))

No tags Hide

So my previous post showed a trick with stuffing extra bits into the pointers of a simple hash table.  In most real situations, we want the hash to still be useful as the number of elements increases: the simple hash I used will slow down dramatically as we pass 75% utilization.  In addition, we want to be able to delete entries, which can be slightly painful in a linear hash.

Since then, I’ve been trying a few different approaches: in fact, I spent quite a while down this rabbit hole before digging myself out.  Here are the approaches I benchmarked (on a 64-bit system):

  1. Doubling: Linear hash with extra hash bits in the pointers, doubling and rehashing everything when it hits > 5/8 full.
  2. Cacheline Tree: Divide the hash into 128-byte groups.  Do linear collision resolution within each group, and when a group fills explode it into 16 pointers to separate hash tables of N 128-byte groups.  When that fills, explode again, etc.  In practice, N was 2.
  3. Cacheline Tree Then Linked:  Explode the top level like the above, but allocate a single cacheline (N=1) for the second level and just fill the second level from first bucket onwards.  If that fills, simply chain on another cacheline.  This actually freed up on delete as chains were emptied.
  4. Cacheline Linked: Even simpler: when the toplevel cacheline fills, just chain another cacheline off it.  Each cacheline is simply filled in order, but we only put 15 pointers in each cacheline and use the extra 4 bits per entry for additional hash bits (to reduce comparisons).  This also freed them up on delete.
  5. Chaining: Double linked list through each object: when a bucket fills we simply append the new entry to the existing ones.

Then I benchmarked a 2M-bucket (actually 2097152) hashtable of each type, increasing the number of entries by a factor of 2 at a time from 1M.  The good thing about visiting the Canberra office is I could use one of the  older Power machines here with gobs of memory.

The first three tests were very similar to the last post, except I used an array of scrambled pointers so I wasn’t accessing the objects in memory order.  The last two are new:

  1. Insert all the objects into the hash.
  2. Find all the objects in the hash.
  3. Do the same number of missed searches in the hash.
  4. Delete half the objects, then time reinserting them (steady-state test which avoids any allocation penalties for expansion).
  5. Delete every object from the hash.

Insert

Our simple link of cachelines (CLT Linked) gets so slow I took it out of the benchmark after 8x overload. Chained is the winner: we expect it to modify the inserted entry, the hash bucket and the current top entry for a total of three cache misses (and the object being inserted would generally be hot). It also never needs to (re)alloc or rehash.

We can see the CLT-then-linked case is getting worse as we fall into the chain and have to linear search for the entry to place into; there are 16 times as many of these chains as in the dumb “link of cache lines” case, because we have a link off each pointer off each toplevel hash entry, rather than one off each cacheline.

If we look at reinserts, we can remove the realloc and rehash overhead:

In steady-state we can see the others are more competitive, though even our flat linear hash loses a little as it scans for the next NULL bucket.

Memory Usage


The optimal reasonable case would be one pointer per object; 8 bytes. The two linked cases quickly approach this (at which point they suck). The chaining cases uses two pointers; doubling uses 2 pointers for these values too (it can do slightly better just before it doubles, and worse just afterwards).

The real story is the cacheline tree. It hits a tipping point as buckets fill: that 512-byte second level off each top level pointer means we can spend 131 bytes per object just after we fill the top level. At 8 million objects in our 1 million buckets we are in fact spending 131 bytes per object (vs 17 for the chained case).

This is why I reduced the second level hash size to two cachelines; and we still get that factor of 16 explosion on each expansion. In fact, the chaining hashes were an attempt to find a “good enough” variant which didn’t explode like this. The chaining variant (CLT then linked) explodes once, to 65 bytes per entry, but then expands linearly.

Lookups

This is where the cache damage of chaining becomes apparent. Doubling performs best, and the tree is close behind. Even a linked list of cachelines beats chaining: with 15 entries per cacheline on average 13 of those will not need to be examined (since the lower pointer bits where we’ve stashed another 3 hash bits will only accidentally match for two of them). This means 1/5 of the cache misses of the chained case for really large overload values.

By 64 entries per bucket (128M objects) chaining is 16 times slower than doubling for a successful lookup, and 20 times slower for an unsuccessful one. Even at 8 per bucket, it matches our terrible chained case at being 3 times slower than the linear case.

Since I assume most hashes are lookup heavy, this result should be weighted more than add or delete.

Delete

Chained delete is cheap: you might get three cache misses (one for the object you’re deleting, and one for the next and prev objects in the linked list) but you don’t need to do a hash lookup at all. The linear hash has a cache miss to cache the object, then a cache miss to do the delete, but then it has to rehash any objects in immediately following hash buckets. It’s 2.2 times slower at 64x overload. We could use a special value to avoid this.

Cacheline tree does OK, but still needs to hash and similarly reseat any “runs” within the bottom cacheline. At three levels, add cache misses for the second level and third levels. 3.1 times slower than chained at 64x overload.

Conclusion

I’d love to spend more time refining these results. In particular, there are ways to steal more bits to store extra hashing, ways to avoid reseating entries on linear deletes, ways to avoid rehashing to double the linear case. Compression of pointers is also possible, but it gets messy since there are a no longer a fixed number of entries per cacheline. I might get time for some of this, but at some point you have to get on with less abstract work!

But the takeaway is this: it’s hard to beat the simplicity of a big hash array which doubles and stashes a few extra hash bits. If your workload is add/delete heavy, use chaining. If you are worried about unknown work sizes, use doubling. Or do what Perl does, and do both (though you can’t use the extra hash bit stowing with chaining).

The cache tree idea is not best at anything, unless you can’t double for some reason (the Linux kernel probably has this issue, and TDB may or may not). (more…)

No tags Hide

(FIXED: my “NoMatched” bench was bogus, and actually the same as “Matched” case. Graphs changed, but conclusions the same.)
Everyone loves hash tables; certainly the networking code in the kernel uses them heavily and I do too.  But with more work on TDB (which is basically a hash table in a file), I’ve been re-measuring some of the fundamentals.

We all know that hash tables are cache-unfriendly; a good hash table means you’re hitting a “random” bucket every time.   Back in my university days this was never mentioned (caches were new and fancy things); the optimal hash table was seen as a prime-numbered sequence of buckets using a primary hash then a secondary hash to jump to the next bucket if this was full.  This gives fewer lookups than a naive linear search “if this bucket is full, go to the next one”, but does this override the cache unfriendliness of the approach?

These days it’s normal to use a fast universal hash and a power-of-two for the hash table size (Perl 5 and the Linux kernel do this, though Perl uses another weaker hash also by Jenkins).  Also, we deal with hash collisions by chaining off entries rather than using other hash buckets.  This has two main advantages: (1) it doesn’t barf when the hash table fills up, and (2) you can easily delete an entry from the hash table.  Unfortunately, this is pretty cache-unfriendly too, compared to a linear bucket fallback, but can we measure it?

To this end, I wrote a simple example program which filled each kind of hash table (linear hash bucket, secondary hash bucket and chained hash buckets) to 50%, then ran successful lookups, then ran failed lookups.  64 bit machine, and I chose 2^27 as my hash size giving a 1G hash table.  While most hashes are not this large, it’s useful to approximate a situation where other things are also using the cache.

This result surprised me, so I fiddled a bit and the basic shape remained; the differences vanished if the table was (say) only 10% full, but even at 25% we see chaining win. Here it’s about 15% faster on matched lookups (5% faster on misses).

What does this mean? It looks like linear edges over secondary for insertion, but for lookup it’s a wash.  But both are overshadowed by chaining, which doesn’t penalize other buckets when there’s a hash clash.  Despite the fact that objects are one pointer larger in the chaining case, it still wins (the test code uses two 64-bit values as the basic object).

But Now For The Twist…

The reason I was looking at old-school non-chained hashing schemes was because I have a new trick for them: stashing some hash bits in the bottom bits of the pointer.  This allows you to skip the dereference 7/8 times (assuming 64-bit-aligned pointers).

So, we add that trick to the linear case:

Now we can see the potential of not chaining through hash entries! It’s a full 25% faster than the normal linear case, and 10% faster than chaining (almost 20% faster in the missed case). Of course, for real life we need to handle delete and hash table filling. And that’s the subject of my next post, once I’ve finished coding it!

No tags Hide

Several months ago Donna Benjamin convinced me to speak at LUV, and now the time is approaching.  My intent is to give a talk about bits and pieces, and rely on questions to see what the audience is interested in.  Pretty casual and should be heaps of fun if you’re nearby.

I will also be taking the chance to drop by Lonely Planet; one of the places we stayed in Lithuania had seen an influx of foreign tourists since being featured in their guide, so they forced a note and a bottle of their vodka on us to take to LP.  Naturally, we Australians all know each other, so I couldn’t turn them down :)

No tags Hide

Feb/10

20

Rusty’s Travels

Headed through Germany 26th through 3rd March or so, then Lithuania via Poland.  Back via Singapore on 24/25 March.

My email will be intermittent (I hope!) but if you’re around and want to grab a meal or a beer with us, ping me!

No tags Hide

Feb/10

16

Followup: lrzip

Mikael noted in my previous post that Con Kolivas’s lrzip is another interesting compressor.  In fact, Con has already done a simple 64-bit enhance of rzip for lrzip, and on our example file it gets 56M vs 55M for xz (lrzip in raw mode, followed by xz, gives 100k worse than just using lrzip: lrzip already uses lzma).

Assuming no bugs in rzip, the takeaway here is simple: rzip should not attempt to find matches within the range that the backend compressor (900k for bzip2 in rzip, 32k for gzip, megabytes for LZMA as used by lrzip).  The backend compressor will do a better job (as shown by similar results with lrzip when I increase the hash array size so it finds more matches: the resulting file is larger).

The rzip algorithm is good at finding matches over huge distances, and that is what it should stick to.  Huge here == size of file (rzip does not stream, for this reason).  And this implies only worrying about large matches over huge distances (the current 32 byte minimum is probably too small).  The current version of rzip uses an mmap window so it never has to seek, but this window is artificially limited to 900MB (or 60% of mem in lrzip).   If we carefully limit the number of comparisons with previous parts of the file, we may be able to reduce them to the point where we don’t confuse the readahead algorithms and thus get nice performance (fadvise may help here too) whether we are mmaped or seeking.

I like the idea that rzip should scale with the size of the file being compressed, not make assumptions about today’s memory sizes.  Though some kind of thrash detection using mincore might be necessary to avoid killing our dumb mm systems :(

No tags Hide

Feb/10

15

xz vs rzip

As the kernel archive debates replacing .bz2 files with .xz, I took a brief glance at xz. My test was to take a tarball of the linux kernel source (made from a recent git tree, but excluding the .git directory):

     linux.2.6.tar 395M

For a comparison, bzip2 -9, rzip -9 (which uses bzip2 after finding distant matches), and xz:

     linux.2.6.tar.bz2 67M
     linux.2.6.tar.rz 65M
     linux.2.6.tar.xz 55M

So, I hacked rzip with a -R option to output non-bzip’d blocks:

     linux.2.6.tar.rawrz 269M

Xz on this file simulates what would happen if rzip used xz instead of libbz2:

     linux.2.5.tar.rawrz.xz 57M

Hmm, it makes xz worse!  OK, what if we rev up the conservative rzip to use 1G of memory rather than 128M max?  And the xz that?

     linux.2.6.tar.rawrz 220M
     linux.2.6.tar.rawrz.xz 58M

It actually gets worse as rzip does more work, implying xz is finding quite long-distance matches (bzip2 won’t find matches over more than 900k).  So, rzip could only have benefit over xz on really huge files: but note that current rzip is limited on filesize to 4G so it’s a pretty small useful window.

No tags Hide

Feb/10

12

Code review: libreplace

libreplace is the SAMBA library (also used in ctdb) to provide working implementations of various standard(ish) functions on platforms where they are missing or flawed.  It was initially created in 1996 by Andrew Tridgell based on various existing replacement hacks in utils.c (see commit 3ee9d454).

The basic format of replace.h is:

    #ifndef HAVE_STRDUP
    #define strdup rep_strdup
    char *rep_strdup(const char *s);
    #endif

If configure fails to identify the given function X, rep_X is used in its place.  replace.h has some such declarations, but most have migrated to the system/ include directory which has loosely grouped functions by categories such as dir.h, select.h, time.h, etc.  This works around the “which header(s) do I include” problem as well as guaranteeing specific functions.

Other than reading this code for a sense of Unix-like paleontology (and it’s so hard to tell when to remove any of these helpers that cleanups are rare) we can group replacements into three categories:

  1. Helper functions or definitions which are missing, eg. strdup or S_IRWXU.
  2. “Works for me” hacks for platform limitations, which make things compile but are not general, and
  3. Outright extensions, such as #define ZERO_STRUCT(x) memset((char *)&(x), 0, sizeof(x)) or Linux kernel inspired likely()

Since it’s autoconf-based, it uses the standard #ifdef instead of #if (a potential source of bugs, as I’ve mentioned before).  I’ll concentrate on the insufficiently-general issues which can bite users of the library, and a few random asides.

  • #ifndef HAVE_VOLATILE ?  I can’t believe Samba still compiles on a compiler that doesn’t support volatile (this just defines volatile away altogether)  If it did no optimizations whatsoever, volatile might not matter, but I’m suspicious…
  • typedef int bool; is a fairly common workaround for lack of bool, but since pointers implicitly cast to bool but can get truncated when passed as an int, it’s a theoretical trap. ie. (bool)0×1234567800000000 == true, (int)0×1234567800000000 == 0.
  • #if !defined(HAVE_VOLATILE) is the same test as above, repeated. It’s still as bad an idea as it was 186 lines before :)
  • ZERO_STRUCT, ZERO_ARRAY and ARRAY_SIZE are fairly sane, but could use gcc extensions to check their args where available. I implemented this for ARRAY_SIZE in the Linux kernel and in CCAN. Making sure an arg is a struct is harder, but we could figure something…
  • #define PATH_MAX 1024 assumes that systems which don’t define PATH_MAX probably have small path limits. If it’s too short though, it opens up buffer overruns. Similarly for NGROUPS_MAX and PASSWORD_LENGTH.
  • The dlopen replacement is cute: it uses shl_load where available (Google says HPUX), but dlerror simply looks like so:
        #ifndef HAVE_DLERROR
        char *rep_dlerror(void)
        {
         	return "dynamic loading of objects not supported on this platform";
        }
        #endif

    This cute message for runtime failure allows your code to compile, but isn’t helpful if dlopen was a requirement. Also, this should use strerror for shl_load.

  • havenone.h is (I assume) a useful header for testing all the replacements at once: it undefines all HAVE_ macros. Unfortunately it hasn’t been updated, and so it isn’t complete (unused code is buggy code).
  • inet_pton is credited to Paul Vixie 1996. It’s K&R-style non-prototype, returns an int instead of bool, and doesn’t use strspn ((pch = strchr(digits, ch)) != NULL) or (better) atoi. But it checks for exactly 4 octets, numbers > 255, and carefully doesn’t write to dst unless it succeeds. I would have used sscanf(), which wouldn’t have caught too-long input like “1.2.3.4.5″. OTOH, it would catch “127…1″ which this would allow. But making input checks more strict is a bad way to be popular…
  • Tridge’s opendir/readdir/telldir/seekdir/closedir replacement in repdir_getdents.c is a replacement for broken telldir/seekdir in the presence of deletions, and a workaround for (older?) BSD’s performance issues. It is in fact never used, because the configure test has had #error _donot_use_getdents_replacement_anymore in it since at least 2006 when the Samba4 changes were merged back into a common library!
  • repdir_getdirents.c is the same thing, implemented in terms of getdirents rather than getdents; it’s still used if the telldir/delete/seekdir test fails.
  • replace.c shows some of the schizophrenia of approaches to replacement: rep_ftruncate #errors if there’s no chsize or F_FREESP ioctl which can be used instead, but rep_initgroups returns -1/ENOSYS in the similar case. Best would be not to implement replacements if none can be implemented, so compile will fail if they’re used.
  • rep_pread and rep_pwrite are classic cases of the limitations of replacement libraries like this. As pread is not supposed to effect the file offset, and file offsets are shared with children or dup’d fds. There’s no sane general way to implement this, and in fact tdb has to test this in tdb_reopen_internal. I would implement a read_seek/write_seek which are documented not to have these guarantees. I remember Tridge ranting about glibc doing the same kind of unsafe implementation of pselect :)
  • snprintf only rivals qsort-with-damn-priv-pointer for pain of “if only they’d done the original function right, I wouldn’t have to reimplement the entire thing”. I’ll avoid the glibc-extracted strptime as well.

I’m not sure Samba compiles on as many platforms as it used to; Perl is probably a better place for this kind of library to have maximum obscure-platform testing. But if I were to put this in CCAN, this would make an excellent start.

No tags Hide

« Previous Page« Previous Entries

Next Entries »Next Page »

Find it!

Theme Design by devolux.org

Tag Cloud