On C Library Implementation
I recently came across the Judy library which seems like an impressive set of data structures to implement mappings (think variable length arrays or hash indexed by simple key). I also looked through the codec2 sources recently, and I saw some of the same API issues, so I thought I'd crystalize some of them here:
- Packaging code libraries is a pain. There's an obscene amount of clutter around the code itself. Some can be blamed on autoconf, but common practices like putting the source under src/ make it worse. In Judy there are 29 actual core code files, but 33 infrastructure files (automake et al) and AUTHORS, ChangeLog, INSTALL, COPYING and 13 README files.
- Typedefs which are always the same are silly. Don't typedef long or void * or const void *, you're confusing documentation with real typesafety. Undefined structs are your friend. If your functions take a context, make it a "struct mylib" (or union) and don't expose the definition in the public header. Your code will be clearer and more typesafe and so will the library users'.
- Context creation and destruction are common patterns, so stick with "mylib_new()" and "mylib_free()" and everyone should understand what to do. There's plenty of bikeshedding over the names, but these are the shortest ones with clear overtones to the user.
- If your library needs initialization use "mylib_init()" or "mylib_setup()".
- Try to make your API impossible to misuse (ie. no mylib_init() at all if you can manage it), but if that fails assert(), abort() or crash. You are not doing developers a favor by making misuse easy to ignore. If you worry about crashing on users, and can handle an API error gracefully, put that handling under #ifdef NDEBUG, like assert does. But this is not where you should be spending your development effort!
- If your only possible error is out-of-memory, consider not handling it at all. It's pretty certain none of your library users will handle it (remember Tridge's Law: untested code is broken code). If it's convenient to return NULL, do so. Otherwise offer a way of registering an error function and let the user deal with setjmp etc. Do not complicate your API to handle this case.
- Always return the error. If you return a pointer, NULL is standard. Magic pointers for different errors can work too, but at that point some other method might be preferable.
- Think hard before implementing multiple levels of interface. It's better to have everyone understand your interface and have 10% not use it (or hack it!) because it's incomplete, than have 50% reinvent it because yours is too complicated.
- There's a standard naming for "I know what I'm doing" low-level alternate functions: the single-underscore prefix (eg. _exit()). That's your one extra level of interface, and if you need more see the previous point. Perhaps another library is in order, on top of this one? This is a standard violation: see Glen's comment below.
- Don't worry about namespace issues too much. Avoid the standard namespace and settle on a short prefix (eg. opt_, judy_, talloc_), but there's little you can do if your name hits someone else's so don't make your code a pain to use just to try to uniquify the names (ccan_rusty_russell_opt_ prefixes, for example).
- Your header(s) should be readable, and in a readable order. I consider extracting comments a complete waste of time, but kerneldoc and doxgen et. al. provide a consistent model for documenting your API in the headers, and should be used as such.
- Don't sweat portability too much. Try not to paint yourself into a corner, but if you're not going to test on other platforms leave the serious effort until someone does.
- I do consider use of C++ keywords in headers overly unfriendly to our C++ cousins, but in the code it's fine; I avoid putting extern "C" in my headers, as C++ people can do that (and whatever else they need) before #including the header. Again, if you're not going to test it, don't work up a sweat about it.
- There's a standard naming scheme for C, and it's all lower case with underscores as separators. Don't BumpyCaps.
On some level, each of these influenced CCAN as we thought "how do we encourage people to reuse this code?" Many are specific applications of the CCAN Golden Rule (aka. "our code should not be ugly") which drives that attempt to make code reusable.
Packaging cruft in CCAN is sidestepped entirely; since people are expected to include ccan/ in their own code, we leave that to them, with some guidelines on how to use the code:
- You supply a "config.h" somewhere we can #include it, with various #define HAVE_FOO in it.
- The _info file is actually a C program: the header contains (trivial-to-parse kerneldoc style) metainformation such as author, description, and example program(s). These are aimed at human consumption: compiling and executing the code gives dependencies in a parsable form (still, only the ccan/ dependencies are the only ones really amenable to machine use).
- The (main?) header will be the same name as the module, with ".h" appended, eg "ccan/foo/foo.h". The kerneldoc-style comments are the module API documentation.
- The test/ subdir contains tests of standard forms (eg. run-* test are to be executed).
- DEBUG can be defined for "give me extra debugging on this module, performance cost be damned". This is separate from normal sanity checks which would be disabled by defining NDEBUG.
In other words, you see the code first and foremost when you look in a CCAN module.
We also have a "namespacize" tool (very primitive, needs love) which is designed to rewrite the code in case of namespace clashes (currently it prepends ccan_, and scans dependent modules and fixes them up too). This allows us to ignore namespace issues :)
The "ccanlint" tool is CCAN's always-too-primitive checker/scorer tool, which ideally would judge a module on all of the complaints above. It currently runs the test cases, checks their coverage and the like, but of course it'll never be sophisticated enough to judge interfaces. However, it does check for the existence of kerneldoc comments, and even extract and compile Example: code fragments (using some fun heuristics) to make sure they haven't bitrotted. And I keep adding new checks which make it more useful.
This probably explains why I stay up late writing CCAN modules: I can concentrate on writing the code (and tests!) without the other cruft. And I don't even need to stick to an API or ABI for my modules (api* tests indicate an API guarantee, and only three modules have those!). And because it's source-level, you don't have to stretch your interface to cover all the cases; users with weird needs can simply use the code as a basis for their own.