Previously I discussed the use of IBLTs (on the pettycoin blog). Kalle and I got some interesting, but slightly different results; before I revisited them I wanted some real data to play with.
Finally, a few weeks ago I ran 4 nodes for a week, logging incoming transactions and the contents of the mempools when we saw a block. This gives us some data to chew on when tuning any fast block sync mechanism; here’s my first impressions looking a the data (which is available on github).
These graphs are my first look; in blue is the number of txs in the block, and in purple stacked on top is the number of txs which were left in the mempool after we took those away.
The good news is that all four sites are very similar; there’s small variance across these nodes (three are in Digital Ocean data centres and one is behind two NATs and a wireless network at my local coworking space).
The bad news is that there are spikes of very large mempools around block 352,800; a series of 731kb blocks which I’m guessing is some kind of soft limit for some mining software [EDIT: 750k is the default soft block limit; reported in 1024-byte quantities as blockchain.info does, this is 732k. Thanks sipa!]. Our ability to handle this case will depend very much on heuristics for guessing which transactions are likely candidates to be in the block at all (I’m hoping it’s as simple as first-seen transactions are most likely, but I haven’t tested yet).
> heuristics for guessing which transactions are likely candidates to be in the block at all
The default transaction selection behavior is defined here in CreateNewBlock: https://github.com/bitcoin/bitcoin/blob/v0.10.2/src/miner.cpp#L91
It does greedy selection, first based on coin age (up till the default block priority size of 50 kB), then by fee rate (fee per kB) for the rest of the block size.
The data I’ve studied so far seems to fit this model pretty well.
Excellent! There is certainly some variant in miners’ versions, though, and they’re not obliged to follow the default rules. Good to know a simple approximation is reasonably valid (at least for this data).
Check out this graph I just put together – it illustrates a bit the transaction selection policies of miners:
http://www.reddit.com/r/Bitcoin/comments/38272q/