Previously I discussed the use of IBLTs (on the pettycoin blog). Kalle and I got some interesting, but slightly different results; before I revisited them I wanted some real data to play with.
Finally, a few weeks ago I ran 4 nodes for a week, logging incoming transactions and the contents of the mempools when we saw a block. This gives us some data to chew on when tuning any fast block sync mechanism; here's my first impressions looking a the data (which is available on github).
These graphs are my first look; in blue is the number of txs in the block, and in purple stacked on top is the number of txs which were left in the mempool after we took those away.
The good news is that all four sites are very similar; there's small variance across these nodes (three are in Digital Ocean data centres and one is behind two NATs and a wireless network at my local coworking space).
The bad news is that there are spikes of very large mempools around block 352,800; a series of 731kb blocks which I'm guessing is some kind of soft limit for some mining software [EDIT: 750k is the default soft block limit; reported in 1024-byte quantities as blockchain.info does, this is 732k. Thanks sipa!]. Our ability to handle this case will depend very much on heuristics for guessing which transactions are likely candidates to be in the block at all (I'm hoping it's as simple as first-seen transactions are most likely, but I haven't tested yet).