On C Library Implementation

I recently came across the Judy library which seems like an impressive set of data structures to implement mappings (think variable length arrays or hash indexed by simple key).  I also looked through the codec2 sources recently, and I saw some of the same API issues, so I thought I’d crystalize some of them here:

  1. Packaging code libraries is a pain. There’s an obscene amount of clutter around the code itself. Some can be blamed on autoconf, but common practices like putting the source under src/ make it worse. In Judy there are 29 actual core code files, but 33 infrastructure files (automake et al) and AUTHORS, ChangeLog, INSTALL, COPYING and 13 README files.
  2. Typedefs which are always the same are silly. Don’t typedef long or void * or const void *, you’re confusing documentation with real typesafety. Undefined structs are your friend. If your functions take a context, make it a “struct mylib” (or union) and don’t expose the definition in the public header. Your code will be clearer and more typesafe and so will the library users’.
  3. Context creation and destruction are common patterns, so stick with “mylib_new()” and “mylib_free()” and everyone should understand what to do.  There’s plenty of bikeshedding over the names, but these are the shortest ones with clear overtones to the user.
  4. If your library needs initialization use “mylib_init()” or “mylib_setup()”.
  5. Try to make your API impossible to misuse (ie. no mylib_init() at all if you can manage it), but if that fails assert(), abort() or crash. You are not doing developers a favor by making misuse easy to ignore. If you worry about crashing on users, and can handle an API error gracefully, put that handling under #ifdef NDEBUG, like assert does. But this is not where you should be spending your development effort!
  6. If your only possible error is out-of-memory, consider not handling it at all. It’s pretty certain none of your library users will handle it (remember Tridge’s Law: untested code is broken code). If it’s convenient to return NULL, do so. Otherwise offer a way of registering an error function and let the user deal with setjmp etc. Do not complicate your API to handle this case.
  7. Always return the error. If you return a pointer, NULL is standard. Magic pointers for different errors can work too, but at that point some other method might be preferable.
  8. Think hard before implementing multiple levels of interface. It’s better to have everyone understand your interface and have 10% not use it (or hack it!) because it’s incomplete, than have 50% reinvent it because yours is too complicated.
  9. There’s a standard naming for “I know what I’m doing” low-level alternate functions: the single-underscore prefix (eg. _exit()).  That’s your one extra level of interface, and if you need more see the previous point.  Perhaps another library is in order, on top of this one? This is a standard violation: see Glen’s comment below.
  10. Don’t worry about namespace issues too much.  Avoid the standard namespace and settle on a short prefix (eg. opt_, judy_, talloc_), but there’s little you can do if your name hits someone else’s so don’t make your code a pain to use just to try to uniquify the names (ccan_rusty_russell_opt_ prefixes, for example).
  11. Your header(s) should be readable, and in a readable order.  I consider extracting comments a complete waste of time, but kerneldoc and doxgen et. al. provide a consistent model for documenting your API in the headers, and should be used as such.
  12. Don’t sweat portability too much.  Try not to paint yourself into a corner, but if you’re not going to test on other platforms leave the serious effort until someone does.
  13. I do consider use of C++ keywords in headers overly unfriendly to our C++ cousins, but in the code it’s fine; I avoid putting extern “C” in my headers, as C++ people can do that (and whatever else they need) before #including the header.  Again, if you’re not going to test it, don’t work up a sweat about it.
  14. There’s a standard naming scheme for C, and it’s all lower case with underscores as separators.  Don’t BumpyCaps.

On some level, each of these influenced CCAN as we thought “how do we encourage people to reuse this code?”   Many are specific applications of the CCAN Golden Rule (aka. “our code should not be ugly”) which drives that attempt to make code reusable.

Packaging cruft in CCAN is sidestepped entirely; since people are expected to include ccan/ in their own code, we leave that to them, with some guidelines on how to use the code:

  • You supply a “config.h” somewhere we can #include it, with various #define HAVE_FOO in it.
  • The _info file is actually a C program: the header contains (trivial-to-parse kerneldoc style) metainformation such as author, description, and example program(s).  These are aimed at human consumption: compiling and executing the code gives dependencies in a parsable form (still, only the ccan/ dependencies are the only ones really amenable to machine use).
  • The (main?) header will be the same name as the module, with “.h” appended, eg “ccan/foo/foo.h”.  The kerneldoc-style comments are the module API documentation.
  • The test/ subdir contains tests of standard forms (eg. run-* test are to be executed).
  • DEBUG can be defined for “give me extra debugging on this module, performance cost be damned”.  This is separate from normal sanity checks which would be disabled by defining NDEBUG.

In other words, you see the code first and foremost when you look in a CCAN module.

We also have a “namespacize” tool (very primitive, needs love) which is designed to rewrite the code in case of namespace clashes (currently it prepends ccan_, and scans dependent modules and fixes them up too).  This allows us to ignore namespace issues :)

The “ccanlint” tool is CCAN’s always-too-primitive checker/scorer tool, which ideally would judge a module on all of the complaints above.  It currently runs the test cases, checks their coverage and the like, but of course it’ll never be sophisticated enough to judge interfaces.  However, it does check for the existence of kerneldoc comments, and even extract and compile Example: code fragments (using some fun heuristics) to make sure they haven’t bitrotted.  And I keep adding new checks which make it more useful.

This probably explains why I stay up late writing CCAN modules: I can concentrate on writing the code (and tests!) without the other cruft.  And I don’t even need to stick to an API or ABI for my modules (api* tests indicate an API guarantee, and only three modules have those!).  And because it’s source-level, you don’t have to stretch your interface to cover all the cases; users with weird needs can simply use the code as a basis for their own.

21 replies on “On C Library Implementation”

  1. Completely agreed. There is one other situation in which you might want multiple levels of interface: if you have an upper layer which maintains large amounts of state (common in user-facing libraries, the upper layers might read in some configuration file or pull stuff over the network, ancillary to the interface’s main goal, and set heaps of defaults or something), you should provide a lower-level interface which is stateless or nearly so and takes parameters instead (if there are a lot of them, or an open-ended set, consider taking a hash, even if it means a bit more validation).

    This is killingly useful for reproducible automated testing: it means your testcases don’t need to jump about setting up the same state, but can just call the lower-level functions with fixed parameters. The only things that need to set up state are the things that test the state-readers themselves: all the other details of the interface can be tested with testcases that deal with simple stateless functions.

    1. Then it’s almost two separate libraries, but since packaging libraries is such a PITA, most people would be tempted to package it as one.

      With testing, this is why CCAN tests (except api-* tests) have to actually #include the C files. It allows such unit testing in a straightforward way, with the downside that you can’t do major tests if you have file-scope name conflicts (which is unlikely to be a good idea for the size of CCAN modules we see!).

  2. Re (9). Identifiers starting with _ are reserved by the C standard for the language and library. If you are writing an application library rather than a system library then underscore prefixes don’t belong.

    _exit() is an internal implementation detail of exit() that leaked, not an example of how an API should be written.

    As for portability, you should decide what you want. Many years ago that meant being able to be parsed by over a dozen compilers, all with their own oddities. Today you could write to the GCC compiler and claim to be portable, such is the wide availability of GCC.

    I would add one other thing, and that is to avoid deprecated libraries.

    1. > Re (9). Identifiers starting with _ are reserved by the C standard for the language and library

      You’re right, I misremembered. The kernel uses double-underscore as the “I know what I’m doing”, which is what I was thinking about here. It’s annoying, because they already have “_[A-Z_]” to play with, and _ is a nice, convenient shorthand for “reduced-exposure” functions, and has a history which predates ANSI. Sigh. I’m certainly not going to wean talloc off it. Can you think of a nice alternative?

      As to deprecated libraries, would anyone use a deprecated library when they had a replacement available? If they didn’t know it was deprecated, or there’s no adequate replacement, the advice is useless isn’t it? When I last compiled with dietlibc, it told me errx was deprecated; that hasn’t stopped me using it, because the alternative is to roll my own (which I did before I discovered err.h).

  3. “There’s a standard naming scheme for C, and it’s all lower case with underscores as separators.”

    A link to that standard would be nice.

    1. I assume you’re being sarcastic? But just in case…

      “standard” meaning “normal practice”. And also “as used by the standard”, since anyone familiar with the C standard library will be aware how prevalent this practice is (and also how weird exceptions like “FILE *” look).

      I don’t think there’s a point in the standard saying “thou shalt not use BumpyCaps”; there are so many more effective ways to write ugly code which can’t be banned even if you wanted to. But if you use the standard libraries, your code won’t be consistently BumpyCaps anyway, so why start?

  4. Hi,

    I agree with most of your comments and already implement most of them in the libssh project. However, i’ve a question on this :
    “Packaging code libraries is a pain. There’s an obscene amount of clutter around the code itself. Some can be blamed on autoconf, but common practices like putting the source under src/ make it worse. In Judy there are 29 actual core code files, but 33 infrastructure files (automake et al) and AUTHORS, ChangeLog, INSTALL, COPYING and 13 README files.”
    I understand that packaging libraries is hard. There’s a lot to make with SOVER which are never right in the upstream release, and so on. But I don’t understand why the presence of a lot of standard files (Authors, Changelog, …) and especially the use of dedicated directories (src/, include/) are a problem.
    Could you elaborate ? I personally see it as a way to structure the code and the makefiles.



    1. Hi Aris,

      I was being a bit whiny, yes. I tend to grab libraries to read the source code, and that is sometimes hard to find amongst all the packaging and metadata. It feels like our priorities are wrong.

      Wouldn’t it be refreshing to grab a “foo” library which had its header in foo.h, the C files all in that same top dir, and everything else under a subdir (metadata/ or whatever)?

      I’m not saying we should actually do this, hence “packaging libraries is a pain” rather than something stronger. For minor-but-still-useful libraries I wonder if the packaging takes more time than the actual coding. If only someone started a project to gather these minor bits of code… oh wait…


      It really depends on the library, but the smaller the library the harder it gets to find the code amongst the packaging.

      For a library like Judy which really is all about the code, its

  5. rusty, you were actually right about _.

    The set of reserved identifiers is: __ followed by lowercase, _ followed by uppercase.

    Unfortunately, this makes it very hard to define any convention for “_” because it will never work for uppercased preprocessor symbols, and for other commonly uppercase things such as enums. However, there is another usage of initial _ (rather than using it for “I know what I’m doing” functions such as “_exit”) that doesn’t have this problem.

    You can use _ to prefix internal functions whose visibility must span library source files (so they cannot be static), for example “_judy_do_something_internal”. It’s useful (or it was useful in olden times) because these symbols are public and overridable by the executable; hence you want to make sure that you don’t pollute the namespace, and that your user doesn’t fall into unexpected traps.

    The nice thing is, usually-uppercase non-public symbols (enums, macros) are not public symbols in the library. They wouldn’t pollute the global namespace, so you can leave them unprefixed and not violate the implementation-reserved namespace.

    That said, modern ELF systems can use the hidden visibility to the same end, and the compiler can also perform more optimizations if you give it visibility hints. Secondarily, libtool (if you want to use it) provides a way to hide these functions on almost all systems it supports. Despite this it remains a pretty common convention, and if I really have to provide unsafe variants, I just suffix them for example “_unsafe” or “_unlocked”.

  6. 2b and 3 are a problem for embedded libraries&applications, where dynamic memory allocation is avoided if at all possible. And contexts are never so big that they NEED to be dynamically allocated (for any reason other than the encapsulation one you listed).

    14 depends on whether you’re spending more time dealing with standard C libraries or some other API. The Win32 API tends to use BumpyCaps for functions and variables (and Hungarian notation, as well), so it actually becomes *more* natural.

  7. Hi Rusty, I wasn’t being sarcastic, but I have my reasons for asking. Thanks to a bully in the 4th grade, I can’t reach the underscore without taking my right hand off home row. It doesn’t matter if the keyboard I’m using is ergonomic or not; my right wrist simply doesn’t have the flexibility for it. Having to do this repeatedly for single identifiers (as in the Linux kernel or GTK+) causes me typos almost every single time.

    I’ve worked in a few code foundries, and encountered Tannenbaum’s maxim first-hand: “The nice thing about standards is that there are so many to choose from.” So your point #14 was a surprise to me.

    Perhaps it would be better described as “the most common convention” rather than “standard”.

  8. I’d like to nominate another one, which is: outside of system-specific typedefs, the use of “long” is almost always wrong.

    People often use long when they want an integer the size of a pointer. It isn’t. The classic example here is 64-bit Windows, where long is still 32 bits. The correct type to use in that case is ptrdiff_t (or size_t for an unsigned version).

    People sometimes use long when they want the largest integer that can be handled with relatively simple basic operations. Again, the Win64 example makes it clear that ptrdiff_t or size_t are better choices here. Systems sometimes support values larger than pointers (see the 40-bit accumulators mentioned below), but that doesn’t mean operations on them are simple. Pointer arithmetic, however, is something that should be pretty well supported on any system. If it isn’t, you have bigger problems. If you are only using very simple operations (e.g., addition, subtraction, and bitwise operations), then perhaps long long would be better. It’s even supported by MSVC now.

    People also sometimes use long when they want at least 32 bits, on the vague premise that someone might have an int smaller than that. There are such systems, and they are sometimes still relevant porting targets (though less and less modern code will run on them without active porting), but in that case one of the size-specific types would be much better, e.g., int32_t. Those are C99 and thus aren’t available everywhere, but it’s much easier to write a replacement than to try to reverse-engineer what the actual precision required for all the variables in your code is.

    My favorite behavior for long actually comes from the TI C64x compilers, where sizeof(long)==8, but the actual value only holds 40 bits. This is because long is used to access the 40-bit hardware accumulators associated with some instructions, but the values must be loaded to/stored from a pair of 32-bit registers. Such 40-bit accumulators are sensible things for certain DSP operations, but a type that holds fewer bits than its size in RAM is not what a casual C programmer would expect.

    1. Agreed, though C99’s intptr_t and uintptr_t seem to be the modern choice. I’d be happy with size_t as a euphemism for “a large well-performing unsigned integer”.

      The Linux kernel assumes sizeof(void *) == sizeof(long), and I’m as guilty as anyone of abusing it.

      I’ll just go modify some ccan packages now…

  9. Timothy, part of the problem with x86-64 is that the 64-bit extensions are just that: extensions. With 64 bits enabled, but not using the 0x40-0x4f instruction prefixes, the word size is assumed to be 32 bits, and R8-R15 don’t get used. Addresses are still 40-48 bits, depending on implementation; segmentation is mostly disabled; and some instructions affecting CPU context (i.e. PUSHA/POPA) transfer 64 bits per register; but these are determined by machine mode, not instruction prefix.

    The C types you mention, are specifically crafted to enforce the platform’s word size choice, so the programmer doesn’t have to.

  10. The interesting thing about writing code C(PC)AN style is that you can end up with so many packages that it just becomes untenable to maintain all those extra files.

    In my case, I have 220 CPAN modules. In the repository each module has a Makefile.PL file (i.e. _info.c) and a Changes file (for hand-maintained change summary for humans).

    All the other files are automatically generated by my release automation at it is moved from the repository to the distribution tarball.

    So CPAN clients and manual installers still get everything they are interested in, but humans looking at the repository aren’t distracted by all that gumph.

  11. “Packaging is a pain”

    That’s why I like FreeBSD. It’s just a useless waste of time packaging all this stuff.

    If good Unix practices, code will compile on all Unix-like systems (BTW, what usually makes FreeBSD a pain is developers making too many assumptions that everyone is “Linux”, or a particular Linux).

    That’s why we have makefiles. Packages have niceties but my experience with Linux makes me not like them. Yes, it takes more time to compile stuff. But I’ve used Debian for years until I got tired of their “packaging” (recently, the MongoDB freaked with the state of Python in Linux distros). Ubuntu broke things in between upgrades, too, and it was badly documented IMHO.

    My experience with FreeBSD has been that it’s a install-once system. Yes, things will go wrong, but you have more tools, more ways to approach the problems, more flexibility – it’s just more Unixy.

  12. When it comes to your desire to go stick the src in the base dir of a lib and “other stuff in subdirs” reality is autoconf or pretty much any make system/standard gets in the way, so it’s probably much better to go structure your source tree nicely. as the autofoo world will live in the base no matter what, just leave this to the wild antics of autosplat and stick source in src/ – even better structure this with src/bin, src/lib src/modules src/tests etc. – so you know where all the source is… its off in src/… somewhere and then even this is nicely broken up into more subdirs based on what they produce (the produce a shared library or 2… produce binaries… modules…. etc.).

    now you can have a deoc/ dir for handling documentation infra/build, and even fun bits like debian/ dirs for holding debian packaging info and so on.

    managing and navigating libraries once you have structured them nicely with some consistent directory hierarchy naming scheme becomes a breeze. when you have a whole bunch of libraries you write and maintain and release and package… it becomes a must! hell even the autosplat setup files then begin to follow a standard pattern so it’s easy to increment version, ass sonames for testing/snapshots and more.

    btw – agree on most things here, though i like to keep tests outside the library tree, but for me that’s different as tests often involve an active gui and lots of fat data files that need to be installed, so forcing a hefty multi-megabyte test suite onto people who just want the library to get on with development is not a great thing to do.

  13. I have to disagree with your suggestion to omit extern “C” {…}. Including this (wrapped in #ifdef __cplusplus of course) saves *me* time because I don’t have to respond to endless complaints from mystified C++ programmers when linking fails. Six lines of boilerplate is a worthwhile tradeoff for that benefit.

  14. @paolo

    Any file scope identifier starting with an underscore is reserved. The identifiers your referring to apply at function scope.

    A function variable named _tmp is fine. But a function _tmp() is not allowed.

  15. Indeed, as noted. Though I’ve taken to use an _ *postfix* for “implementation” functions, but it’s just not as clear :(

    In practice, you can clash with others whatever you do, so I find the concept of “reserved” a bit pretentious. See also: “_exit()”.

Comments are closed.