Archive for August 2014
It became clear that there is much confusion over read and write; eg. Linus thought read() was like write() whereas I thought (prior to my last post) that write() was like read(). Both wrong…
Both Linux and v6 UNIX always returned from read() once data was available (v6 didn’t have sockets, but they had pipes). POSIX even suggests this:
The value returned may be less than nbyte if the number of bytes left in the file is less than nbyte, if the read() request was interrupted by a signal, or if the file is a pipe or FIFO or special file and has fewer than nbyte bytes immediately available for reading.
But write() is different. Presumably so simple UNIX filters didn’t have to check the return and loop (they’d just die with EPIPE anyway), write() tries hard to write all the data before returning. And that leads to a simple rule. Quoting Linus:
Sure, you can try to play games by knowing socket buffer sizes and look at pending buffers with SIOCOUTQ etc, and say “ok, I can probably do a write of size X without blocking” even on a blocking file descriptor, but it’s hacky, fragile and wrong.
I’m travelling, so I built an Ubuntu-compatible kernel with a printk() into select() and poll() to see who else was making this mistake on my laptop:
cups-browsed: (1262): fd 5 poll() for write without nonblock cups-browsed: (1262): fd 6 poll() for write without nonblock Xorg: (1377): fd 1 select() for write without nonblock Xorg: (1377): fd 3 select() for write without nonblock Xorg: (1377): fd 11 select() for write without nonblock
This first one is actually OK; fd 5 is an eventfd (which should never block). But the rest seem to be sockets, and thus probably bugs.
What’s worse, are the Linux select() man page:
A file descriptor is considered ready if it is possible to perform the corresponding I/O operation (e.g., read(2)) without blocking.
... those in writefds will be watched to see if a write will not block...
POLLOUT Writing now will not block.
Man page patches have been submitted…
There are numerous C async I/O libraries; tevent being the one I’m most familiar with. Yet, tevent has a very wide API, and programs using it inevitably descend into “callback hell”. So I wrote ccan/io.
The idea is that each I/O callback returns a “struct io_plan” which says what I/O to do next, and what callback to call. Examples are “io_read(buf, len, next, next_arg)” to read a fixed number of bytes, and “io_read_partial(buf, lenp, next, next_arg)” to perform a single read. You could also write your own, such as pettycoin’s “io_read_packet()” which read a length then allocated and read in the rest of the packet.
This should enable a convenient debug mode: you turn each io_read() etc. into synchronous operations and now you have a nice callchain showing what happened to a file descriptor. In practice, however, debug was painful to use and a frequent source of bugs inside ccan/io, so I never used it for debugging.
And I became less happy when I used it in anger for pettycoin, but at some point you’ve got to stop procrastinating and start producing code, so I left it alone.
Now I’ve revisited it. 820 insertions(+), 1042 deletions(-) and the code is significantly less hairy, and the API a little simpler. In particular, writing the normal “read-then-write” loops is still very nice, while doing full duplex I/O is possible, but more complex. Let’s see if I’m still happy once I’ve merged it into pettycoin…