fcntl lock starvation and TDB

The Trivial DataBase (ccan variant here) uses fcntl locks for consistency: records are chained off a fixed-size hash table (or the free list), and a 1-byte fcntl lock at the offset of the chain head protects all records in that chain.

There's also a tdb_lockall() function which grabs a lock across all the hash chains at once to let you do arbitrary atomic munging. This works because fcntl() locks have an offset and length: you can lock arbitrary byte ranges.

Unfortunately, tdb_lockall() is subject to starvation, at least under Linux. This is because the kernel merely checks whether a lock is available and gets it if it can, rather than queuing behind someone else who wants a superset of the lock. So single byte lockers come in and out while the whole-file locker waits for them all to go away.

I'm not sure this is wrong, and as it's simpler than the alternative, I'm not prepared to change it just yet. Fortunately, there's a way of avoiding this starvation in userspace, which occurred independently to both Volker Lendecke and me. I called this variant tdb_lockall_gradual(), in which we try to lock each chain one at a time so we compete with the single-chain lockers on fair terms. My first naive thought was to try to lock all the chains one at a time in order, nonblocking, then go back and retry (blocking) any we failed to get. This is slow, and can deadlock against another process doing the same thing. Volker's suggestion was much nicer: we do a non-blocking lock, and if that fails we divide and conquer. If we get down to a single record, we do a blocking lock.

I wrote a test program which fires off N children, each of which grabs a random chain lock for 50-150 milliseconds before sleeping for 1 second, then repeating. The parent waits for a second, then tries to do a tdb_lockall() or tdb_lockall_gradual() depending on the commandline. Running it five times and showing the average wait time for the lockall gives this:

Now, regarding performance. While there can be 10000 hash chains, this isn't as bad as it sounds. The fast case is the uncontended one, and that's as fast as before, and the contended case is already slow. I annotated the source to print out how many blocking/nonblocking locks it's actually doing. Inevitably, if there's contention, it will end up dividing down to a blocking lock, so log(numchains) locks before doing a blocking lock.

Processes	Blocking locks	Nonblocking locks	Seconds
5	0-2	1-27	0.03
50	8-12	93-111	0.20
500	13-21	130-170	0.29
5000	309-347	1660-1832	9.1

Sure, that's a lot of locking when we're competing with 5000 processes, but it's less the naive one per chain, and it's clear that it's not the cause of the slowdown (we're doing fewer locks per second than the 5 processes case).

And anyway, locking the entire database cannot be a speed-critical operation. Indeed, after the evidence here, I followed Volker's suggestion to simply replace tdb_lockall() with the new implementation.