Since I was creating large blocks (41662 transactions), I added a little code to time how long they take once received (on my laptop, which is only an i3).
The obvious place to look is CheckBlock: a simple 1MB block takes a consistent 10 milliseconds to validate, and an 8MB block took 79 to 80 milliseconds, which is nice and linear. (A 17MB block took 171 milliseconds).
Weirdly, that’s not the slow part: promoting the block to the best block (ActivateBestChain) takes 1.9-2.0 seconds for a 1MB block, and 15.3-15.7 seconds for an 8MB block. At least it’s scaling linearly, but it’s just slow.
So, 16 Seconds Per 8MB Block?
I did some digging. Just invalidating and revalidating the 8MB block only took 1 second, so something about receiving a fresh block makes it worse. I spent a day or so wrestling with benchmarking…
Indeed, ConnectTip does the actual script evaluation: CheckBlock() only does a cursory examination of each transaction. I’m guessing bitcoin core is not smart enough to parallelize a chain of transactions like mine, hence the 2 seconds per MB. On normal transaction patterns even my laptop should be about 4 times faster than that (but I haven’t actually tested it yet!).
So, 4 Seconds Per 8MB Block?
But things are going to get better: I hacked in the currently-disabled libsecp256k1, and the time for the 8MB ConnectTip dropped from 18.6 seconds to 6.5 seconds.
So, 1.6 Seconds Per 8MB Block?
I re-enabled optimization after my benchmarking, and the result was 4.4 seconds; that’s libsecp256k1, and an 8MB block.
Let’s Say 1.1 Seconds for an 8MB Block
This is with some assumptions about parallelism; and remember this is on my laptop which has a fairly low-end CPU. While you may not be able to run a competitive mining operation on a Raspberry Pi, you can pretty much ignore normal verification times in the blocksize debate.
 I turned on -debug=bench, which produced impenetrable and seemingly useless results in the log.
So I added a print with a sleep, so I could run perf. Then I disabled optimization, so I’d get understandable backtraces with perf. Then I rebuilt perf because Ubuntu’s perf doesn’t demangle C++ symbols, which is part of the kernel source package. (Are we having fun yet?). I even hacked up a small program to help run perf on just that part of bitcoind. Finally, after perf failed me (it doesn’t show 100% CPU, no idea why; I’d expect to see main in there somewhere…) I added stderr prints and ran strace on the thing to get timings.
Linux’s perf competes with early git for title of least-friendly Linux tool. Because it’s tied to kernel versions, and the interfaces changes fairly randomly, you can never figure out how to use the version you need to use (hint: always use -g).
But when it works, it’s very useful. Recently I wanted to figure out where bitcoind was spending its time processing a block; because I’m a cool kid, I didn’t use gprof, I used perf. The problem is that I only want information on that part of bitcoind. To start with, I put a sleep(30) and a big printf in the source, but that got old fast.
Thus, I wrote “perfme.c“. Compile it (requires some trivial CCAN headers) and link perfme-start and perfme-stop to the binary. By default it runs/stops perf record on its parent, but an optional pid arg can be used for other things (eg. if your program is calling it via system(), the shell will be the parent).
So I did some IBLT research (as posted to bitcoin-dev ) and I lazily used SHA256 to create both the temporary 48-bit txids, and from them to create a 16-bit index offset. Each node has to produce these for every bitcoin transaction ID it knows about (ie. its entire mempool), which is normally less than 10,000 transactions, but we’d better plan for 1M given the coming blopockalypse.
For txid48, we hash an 8 byte seed with the 32-byte txid; I ignored the 8 byte seed for the moment, and measured various implementations of SHA256 hashing 32 bytes on on my Intel Core i3-5010U CPU @ 2.10GHz laptop (though note we’d be hashing 8 extra bytes for IBLT): (implementation in CCAN)
Bitcoin’s SHA256: 527.7+/-0.9 nsec
Optimizing the block ending on bitcoin’s SHA256: 500.4+/-0.66 nsec
Intel’s asm rorx: 314.1+/-0.3 nsec
Intel’s asm SSE4 337.5+/-0.5 nsec
Intel’s asm RORx-x8ms 458.6+/-2.2 nsec
Intel’s asm AVX 336.1+/-0.3 nsec
So, if you have 1M transactions in your mempool, expect it to take about 0.62 seconds of hashing to calculate the IBLT. This is too slow (though it’s fairly trivially parallelizable). However, we just need a universal hash, not a cryptographic one, so I benchmarked murmur3_x64_128:
Murmur3-128: 23 nsec
That’s more like 0.046 seconds of hashing, which seems like enough of a win to add a new hash to the mix.
I like data. So when Patrick Strateman handed me a hacky patch for a new testnet with a 100MB block limit, I went to get some. I added 7 digital ocean nodes, another hacky patch to prevent sendrawtransaction from broadcasting, and a quick utility to create massive chains of transactions/
My home DSL connection is 11Mbit down, and 1Mbit up; that’s the fastest I can get here. I was CPU mining on my laptop for this test, while running tcpdump to capture network traffic for analysis. I didn’t measure the time taken to process the blocks on the receiving nodes, just the first propagation step.
1 Megabyte Block
Naively, it should take about 10 seconds to send a 1MB block up my DSL line from first packet to last. Here’s what actually happens, in seconds for each node:
The packet dump shows they’re all pretty much sprayed out simultaneously (bitcoind may do the writes in order, but the network stack interleaves them pretty well). That’s why it’s 67 seconds at best before the first node receives my block (a bit longer, since that’s when the packet left my laptop).
8 Megabyte Block
I increased my block size, and one node dropped out, so this isn’t quite the same, but the times to send to each node are about 8 times worse, as expected:
Using the rough formula of 1-exp(-t/600), I would expect orphan rates of 10.5% generating 1MB blocks, and 56.6% with 8MB blocks; that’s a huge cut in expected profits.
Get a faster DSL connection. Though even an uplink 10 times faster would mean 1.1% orphan rate with 1MB blocks, or 8% with 8MB blocks.
Only connect to a single well-connected peer (-maxconnections=1), and hope they propagate your block.
Refuse to mine any transactions, and just collect the block reward. Doesn’t help the bitcoin network at all though.
Join a large pool. This is what happens in practice, but raises a significant centralization problem.
We need bitcoind to be smarter about ratelimiting in these situations, and stream serially. Done correctly (which is hard), it could also help bufferbloat which makes running a full node at home so painful when it propagates blocks.
Some kind of block compression, along the lines of Gavin’s IBLT idea. I’ve done some preliminary work on this, and it’s promising, but far from trivial.
What happens if bitcoin blocks fill? Miners choose transactions with the highest fees, so low fee transactions get left behind. Let’s look at what makes up blocks today, to try to figure out which transactions will get “crowded out” at various thresholds.
Some assumptions need to be made here: we can’t automatically tell the difference between me taking a $1000 output and paying you 1c, and me paying you $999.99 and sending myself the 1c change. So my first attempt was very conservative: only look at transactions with two or more outputs which were under the given thresholds (I used a nice round $200 / BTC price throughout, for simplicity).
(Note: I used bitcoin-iterate to pull out transaction data, and rebuild blocks without certain transactions; you can reproduce the csv files in the blocksize-stats directory if you want).
Paying More Than 1 Person Under $1 (< 500000 Satoshi)
Here’s the result (against the current blocksize):
Let’s zoom in to the interesting part, first, since there’s very little difference before 220,000 (February 2013). You can see that only about 18% of transactions are sending less than $1 and getting less than $1 in change:
Paying Anyone Under 1c, 10c, $1
The above graph doesn’t capture the case where I have $100 and send you 1c. If we eliminate any transaction which has any output less than various thresholds, we’ll catch that. The downside is that we capture the “sending myself tiny change” case, but I’d expect that to be rarer:
This eliminates far more transactions. We can see only 2.5% of the block size is taken by transactions with 1c outputs (the dark red line following the block “current blocks” line), but the green line shows about 20% of the block used for 10c transactions. And about 45% of the block is transactions moving $1 or less.
Interpretation: Hard Landing Unlikely, But Microtransactions Lose
If the block size doesn’t increase (or doesn’t increase in time): we’ll see transactions get slower, and fees become the significant factor in whether your transaction gets processed quickly. People will change behaviour: I’m not going to spend 20c to send you 50c!
Because block finding is highly variable and many miners are capping blocks at 750k, we see backlogs at times already; these bursts will happen with increasing frequency from now on. This will put pressure on Satoshdice and similar services, who will be highly incentivized to use StrawPay or roll their own channel mechanism for off-blockchain microtransactions.
I’d like to know what timescale this happens on, but the graph shows that we grow (and occasionally shrink) in bursts. A logarithmic graph prepared by Peter R of bitcointalk.org suggests that we hit 1M mid-2016 or so; expect fee pressure to bend that graph downwards soon.
The bad news is that even if fees hit (say) 25c and that prevents all the sub-$1 transactions, we only double our capacity, giving us perhaps another 18 months. (At that point miners are earning $1000 from transaction fees as well as $5000 (@ $200/BTC) from block reward, which is nice for them I guess.)
My Best Guess: Larger Blocks Desirable Within 2 Years, Needed by 3
Personally I think 5c is a reasonable transaction fee, but I’d prefer not to see it until we have decentralized off-chain alternatives. I’d be pretty uncomfortable with a 25c fee unless the Lightning Network was so ubiquitous that I only needed to pay it twice a year. Higher than that would have me reaching for my credit card to charge my Lightning Network account :)
Disclaimer: I Work For BlockStream, on Lightning Networks
Lightning Networks are a marathon, not a sprint. The development timeframes in my head are even vaguer than the guesses above. I hope it’s part of the eventual answer, but it’s not the bandaid we’re looking for. I wish it were different, but we’re going to need other things in the mean time.
I hope this provided useful facts, whatever your opinions.
I used bitcoin-iterate and gnumeric to render the current bitcoin blocksizes, and here are the results.
My First Graph: A Moment of Panic
This is block sizes up to yesterday; I’ve asked gnumeric to derive an exponential trend line from the data (in black; the red one is linear)
That trend line hits 1000000 at block 363845.5, which we’d expect in about 32 days time! This is what is freaking out so many denizens of the Bitcoin Subreddit. I also just saw a similar inaccurate [correction: misleading] graph reshared by Mike Hearn on G+ :(
But Wait A Minute
That trend line says we’re on 800k blocks today, and we’re clearly not. Let’s add a 6 hour moving average:
In fact, if we cluster into 36 blocks (ie. 6 hours worth), we can see how misleading the terrible exponential fit is:
Clearer Graphs: 1 week Moving Average
So, not time to panic just yet, though we’re clearly growing, and in unpredictable bursts.
I’ve been trying not to follow the Great Blocksize Debate raging on reddit. However, the lack of any concrete numbers has kind of irked me, so let me add one for now.
If we assume bandwidth is the main problem with running nodes, let’s look at average connection growth rates since 2008. Google lead me to NetMetrics (who seem to charge), and Akamai’s State Of The Internet (who don’t). So I used the latter, of course:
I tried to pick a range of countries, and here are the results:
% Growth Over 7 years
Countries which had best bandwidth grew about 17% a year, so I think that’s the best model for future growth patterns (China is now where the US was 7 years ago, for example).
If bandwidth is the main centralization concern, you’ll want block growth below 15%. That implies we could jump the cap to 3MB next year, and 15% thereafter. Or if you’re less conservative, 3.5MB next year, and 17% there after.
Previously I discussed the use of IBLTs (on the pettycoin blog). Kalle and I got some interesting, but slightly different results; before I revisited them I wanted some real data to play with.
Finally, a few weeks ago I ran 4 nodes for a week, logging incoming transactions and the contents of the mempools when we saw a block. This gives us some data to chew on when tuning any fast block sync mechanism; here’s my first impressions looking a the data (which is available on github).
These graphs are my first look; in blue is the number of txs in the block, and in purple stacked on top is the number of txs which were left in the mempool after we took those away.
The good news is that all four sites are very similar; there’s small variance across these nodes (three are in Digital Ocean data centres and one is behind two NATs and a wireless network at my local coworking space).
The bad news is that there are spikes of very large mempools around block 352,800; a series of 731kb blocks which I’m guessing is some kind of soft limit for some mining software [EDIT: 750k is the default soft block limit; reported in 1024-byte quantities as blockchain.info does, this is 732k. Thanks sipa!]. Our ability to handle this case will depend very much on heuristics for guessing which transactions are likely candidates to be in the block at all (I’m hoping it’s as simple as first-seen transactions are most likely, but I haven’t tested yet).
The key revelation of the paper is that we can have a network of arbitrarily complicated transactions, such that they aren’t on the blockchain (and thus are fast, cheap and extremely scalable), but at every point are ready to be dropped onto the blockchain for resolution if there’s a problem. This is genuinely revolutionary.
It also vindicates Satoshi’s insistence on the generality of the Bitcoin scripting system. And though it’s long been suggested that bitcoin would become a clearing system on which genuine microtransactions would be layered, it was unclear that we were so close to having such a system in bitcoin already.
Note that the scheme requires some solution to malleability to allow chains of transactions to be built (this is a common theme, so likely to be mitigated in a future soft fork), but Gregory Maxwell points out that it also wants selective malleability, so transactions can be replaced without invalidating the HTLCs which are spending their outputs. Thus it proposes new signature flags, which will require active debate, analysis and another soft fork.
There is much more to discover in the paper itself: recommendations for lightning network routing, the node charging model, a risk summary, the specifics of the softfork changes, and more.
I’ll leave you with a brief list of requirements to make Lightning Networks a reality:
A soft-fork is required, to protect against malleability and to allow new signature modes.
A new peer-to-peer protocol needs to be designed for the lightning network, including routing.
Blame and rating systems are needed for lightning network nodes. You don’t have to trust them, but it sucks if they go down as your money is probably stuck until the timeout.
More refinements (eg. relative OP_CHECKLOCKTIMEVERIFY) to simplify and tighten timeout times.
Wallets need to learn to use this, with UI handling of things like timeouts and fallbacks to the bitcoin network (sorry, your transaction failed, you’ll get your money back in N days).
You need to be online every 40 days to check that an old HTLC hasn’t leaked, which will require some alternate solution for occasional users (shut down channel, have some third party, etc).
A server implementation needs to be written.
That’s a lot of work! But it’s all simply engineering from here, just as bitcoin was once the paper was released. I look forward to seeing it happen (and I’m confident it will).
In Part I I described how a Poon-Dryja channel uses a single in-blockchain transaction to create off-blockchain transactions which can be safely updated by either party (as long as both agree), with fallback to publishing the latest versions to the blockchain if something goes wrong.
In Part II I described how Hashed Timelocked Contracts allow you to safely make one payment conditional upon another, so payments can be routed across untrusted parties using a series of transactions with decrementing timeout values.
Now we’ll join the two together: encapsulate Hashed Timelocked Contracts inside a channel, so they don’t have to be placed in the blockchain (unless something goes wrong).
Revision: Why Poon-Dryja Channels Work
Here’s half of a channel setup between me and you where I’m paying you 1c: (there’s always a mirror setup between you and me, so it’s symmetrical)
The system works because after we agree on a new transaction (eg. to pay you another 1c), you revoke this by handing me your private keys to unlock that 1c output. Now if you ever released Transaction 1, I can spend both the outputs. If we want to add a new output to Transaction 1, we need to be able to make it similarly stealable.
Adding a 1c HTLC Output To Transaction 1 In The Channel
I’m going to send you 1c now via a HTLC (which means you’ll only get it if the riddle is answered; if it times out, I get the 1c back). So we replace transaction 1 with transaction 2, which has three outputs: $9.98 to me, 1c to you, and 1c to the HTLC: (once we agree on the new transactions, we invalidate transaction 1 as detailed in Part I)
Note that you supply another separate signature (sig3) for this output, so you can reveal that private key later without giving away any other output.
We modify our previous HTLC design so you revealing the sig3 would allow me to steal this output. We do this the same way we did for that 1c going to you: send the output via a timelocked mutually signed transaction. But there are two transaction paths in an HTLC: the got-the-riddle path and the timeout path, so we need to insert those timelocked mutually signed transactions in both of them. First let’s append a 1 day delay to the timeout path:
Similarly, we need to append a timelocked transaction on the “got the riddle solution” path, which now needs my signature as well (otherwise you could create a replacement transaction and bypass the timelocked transaction):
Remember The Other Side?
Poon-Dryja channels are symmetrical, so the full version has a matching HTLC on the other side (except with my temporary keys, so you can catch me out if I use a revoked transaction). Here’s the full diagram, just to be complete:
Closing The HTLC
When an HTLC is completed, we just update transaction 2, and don’t include the HTLC output. The funds either get added to your output (R value revealed before timeout) or my output (timeout).
Note that we can have an arbitrary number of independent HTLCs in progress at once, and open and/or close as many in each transaction update as both parties agree to.
Keys, Keys Everywhere!
Each output for a revocable transaction needs to use a separate address, so we can hand the private key to the other party. We use two disposable keys for each HTLC, and every new HTLC will change one of the other outputs (either mine, if I’m paying you, or yours if you’re paying me), so that needs a new key too. That’s 3 keys, doubled for the symmetry, to give 6 keys per HTLC.
Adam Back pointed out that we can actually implement this scheme without the private key handover, and instead sign a transaction for the other side which gives them the money immediately. This would permit more key reuse, but means we’d have to store these transactions somewhere on the off chance we needed them.
Storing just the keys is smaller, but more importantly, Section 6.2 of the paper describes using BIP 32 key hierarchies so the disposable keys are derived: after a while, you only need to store one key for all the keys the other side has given you. This is vastly more efficient than storing a transaction for every HTLC, and indicates the scale (thousands of HTLCs per second) that the authors are thinking.
My next post will be a TL;DR summary, and some more references to the implementation details and possibilities provided by the paper.
 The new sighash types are fairly loose, and thus allow you to attach a transaction to a different parent if it uses the same output addresses. I think we could re-use the same keys in both paths if we ensure that the order of keys required is reversed for one, but we’d still need 4 keys, so it seems a bit too tricky.