Thoughts on configuration files and databases

DISCLAIMER: This site is a mirror of original one that was once available at http://iki.fi/~tuomov/b/

Premises. If we look at the usual ways of storing program and other configuration data, the following general schemes can be found: 1) “proprietary” configuration files, both binary and text format; 2) scripting languages; 3) filesystems and databases; 4) text and binary files encoding a tree structure in some “standard” format. Schemes of the first type include all the traditional *nix configuation files that populate /etc. The second scheme overlaps the first scheme to some degree. Although there's less variety in syntax, the semantical level remains highly “proprietary”. Schemes of the third type include the Windows registry, as well as storing all settings as individual files on a normal file system, one that can efficiently store small files. Schemes of the fourth kind again bear some relationship to the third. Examples of this kind of configuration storage are unfortunately few besides XML, other formats either not being so widespread or so standardised, or both.

I will argue for standard structural formats, but against XML in particular. Indeed, this posting could have been subtitled “XML sucks, Part 2”. My premises are these: 1) the data must be accessible by a great variety of widely available interactive tools in a easily human-understandable format. Presently these tools are the text editors: specialised editors for particular configuration file formats do not come even close to text editors in variety; do not offer choice to fit everyone's tastes. 2) It should likewise be possible to automate operations on the data with little “understanding” of it, both with widely available programming libraries, and in the majority of cases, also by such “quick & dirty” shell tools as sed and grep. I do not, however, confine attention to these particular tools: although presently important, in the future some other tools may take their position. 3) The data should be stored as efficiently as possible without compromising the other premises.

The rationale for these premises is simple: 1) While an interactive program may itself provide a way to access (at least part of) its configuration data, it should not be necessary to be able to run it to alter its configuration – which may even be preventing that. Instead of writing a new tool for every program's every configuration file, it is better to concentrate on providing a variety of tools that can suit everyone's tastes, and work with a range of programs, not all interactive. 2) When configuration files are understood in a wide sense, there are often cases when being able to automate operations speeds up the task a lot. Most of these tasks are “ad hoc” in nature, and a seddable and greppable format suffices. However, for “production” quality automation, it should be possible to obtain a more abstract presentation of the data without wasting effort on writing parsers or interpreters. 3) It should be self-evident that is mindless to waste resources if nothing is gained by it. I will elaborate below on the effect of these premises on the the different schemes mentioned above.

Proprietary formats. The premises above clearly exclude proprietary binary formats. But they also exclude traditional *nix configuration files. It is true, owing to their line-basedness, that these files are often easily processed with the aforementioned sed, grep, and various other shell tools. However, they do not possess a standard structure – and can not possess a rich one – that could be understood by higher-level general-purpose tools. These files also often leave a lot to be desired on the part of human-understandability: the most extreme example is perhaps given by sendmail, but there are other quite bad examples. Some of these files are, in fact, shell scripts in disguise, and thus included in the next scheme.

Scripting languages. Scripting language based configuration storage is very flexible in certain senses: repeated tasks in the configuration file can be automated, configuration can be fetched from other sources without the program specifically supporting this, and redundant work isn't needed in providing both scripting support and configuration files, and their documentation.

These features precisely, however, make these files very unflexible in other senses: the halting problem is our adversary here. It is theoretically impossible for programs to understand the configuration data in every file with code in a Turing-complete language. We can, of course, always try to push the unexplored, endless frontiers of the halting problem further and further out. But such hacks are not without significant cost, and it would be easier to simply limit the expressive power of the configuration language. Especially as the syntax of most programming languages is complex and strict enough to be daunting to both quick&dirty tools and humans as well – not always only novice. Lisp/Scheme are somewhat of an exception: the syntax is very simple. Unfortunately it remains potentially baffling, being even too simple and demanding strict placement of Loads of Infuriating Silly Parentheses.

That said, there are very good reasons for providing scripting support in programs. This in turn creates the temptation to simply use the same scripting language for configuration, to avoid redundant work. I have also done that myself, switched from a proprietary configuration file format to Lua for the configuration of Ion. I have a wonderful excuse, already mentioned above: the effort of providing both scripting support and simpler configuration files, including documentation for both. Perhaps if there were more good tools for a good widespread configuration file format, there would be a higher willingness to do that extra work. Even better would be if there were good tools to take away the extra work. I will return to this topic.

File system storage. Configuration data stored on a file system, or a database mountable as a file system, of course offers structural access even at the shell level, so in that sense this scheme is ideal. However, there presently aren't that good tools for a human-understandable “full” view of a program's configuration stored in this manner. In particular, the documentation that can be embedded in a configuration file would be missing, and it should be the job of the configuration editor to display that.

An interesting possibility is provided by filesystems that support sub-files, i.e. nodes of the file system working both as directories and files at the same time. In that case, the documentation of an option could be conveniently available in the doc sub-file of the configuration option. Yet more interesting possibilities are provided by non-hierarchical file systems, where the documentation for an option could easily be be accessed in multiple locations: /program/doc/config/option, /config/program/option/doc, /doc/program/config/option, and so on. Therefore the documentation would be accessible both in a documentation “tree” (there are no trees on a ‘setfs’, but the idiom shall suffice for now) as well as be conveniently found within the configuration “tree”, when editing the configuration through the file system with basic shell tools. (All these thrilling possibilities offered by advanced file systems remind me of Plan K from Kludgespace: the effort of trying to provide in Linux with FUSE and other kludges, a poor approximation of the neat file system based design of Plan 9 – the logical next step from one of the fundamental principles of Unix.)

In any case, presently file systems aren't quite up to the task of storing individual variables in files. Just as important as the lack of convenient editing tools (or sub-files for shell tools), is the fact that most file systems need a lot of space even for small files. (Yes, I am aware of ReiserFS.) Therefore we turn to the next related scheme, where we have crammed all the variables (and hopefully some documentation) in a single file on the file system, still retaining a standard structure. This structure could, in fact, in principle be displayed as a file system by, for example, a FUSE module.

Standard structural files. As already noted, perhaps the only real example in this category of standard structural configuration files, is XML. But XML syntax clearly violates all of the premises I have set: It is not human-readable. It is not efficient. There is no variety of XML editors that would hide the awful syntax, providing in a usable manner access to the meat encased in layers of fat. Yes, XML structure can be parsed by a (complex) parsing library that does not much understand the actual contents of the file, and that is a good aspect. (Although, programming language constructs have, in fact, been attempted in XML – what wouldn't – with very poor results.) But the syntax does not lend itself to sedding and grepping for quick&dirty tasks, and I am not aware of widely available, simple, shell tools for processing XML. I would, however, welcome a structural shell along with structural sed and grep. But you don't need – or want – the inefficient XML for that.

So why have an awful and terribly inefficient syntax? Why not have a nice syntax, if you lose nothing by doing so? XML syntax is so inefficient and so unreadable, that nothing would be lost by simply storing the structure in a standard binary format that would be much faster for programs to load, and waste much less bandwidth or processing power (hence energy and eventually nature) for transmission or compression. Yes, you mindless XML fanatics, you who were born with silver bullets up your asses: you can have standard binary formats containing just the same data as XML files. XML is simply a terribly inefficient encoding of a tree structure. (The contents of a tag are essentially just another attribute with a special reserved name, and a structural value.) Indeed, given a variety of structural tools, including editors, sed, and grep, there would be no need to inefficiently encode data in a textual format, as everything would be able to understand the standard binary structure format just as they understand the ASCII values of characters.

Yes, you can also encode the information in an XML file in as many different formats of text files as you can imagine. Whereas there unfortunately doesn't appear to be a widespread standardised alternative to XML syntax, at least three styles of syntax in multiple variants are widely in use, all better than the XML syntax: firstly there's the S-expression syntax, that unfortunately usually comes combined with a Turing-complete programming language, the aforementioned Lisp or Scheme. Then there are various braced syntaxes (including the one of CSS), and .INI. The last style, while in many senses my favourite, unfortunately does in its most variants appear to have some limitations that I will come back to. There's also YAML, which actually seems to have some momentum behind it, but I find it too complex (and monoculturist).

As an example, let us have a look at all these syntaxes, along with a traditional *nix configuration file, for the task of configuring the mount point and mounting options for a device. That is, let us look at the question of converting /etc/fstab to another format. Of course, the rendition below is only one possibility of my arbitrary choosing for each of the formats, but should in any case provide a ‘feel’ for each of the formats. I will also not produce the full file (which at least in case of XML would normally have still more tags in it for describing the file), but only a fragment.

The “proprietary” original format:

# <file system> <mnt.pnt> <type>  <options>              <dump> <pass>
/dev/hda1       /         ext3    defaults,errors=remount-ro 0      1

XML:

<mount>
    <device>/dev/hda1</device>
    <point>/</point>
    <type>ext3</type>
    <errors>remount-ro</errors>
    <dump>0</dump>
    <pass>1</pass>
</mount>

S-expression:

(mount
    (device '/dev/hda1)
    (point '/)
    (type 'ext3)
    (errors 'remount-ro)
    (dump 0)
    (pass 1)
)

Braced:

mount{
    device = "/dev/hda1",
    point = "/",
    type = "ext3",
    errors = "remount-ro",
    dump = 0,
    pass = 1,
}

.INI:

[mount]
device = /dev/hda1
point = /
type = ext3
errors = remount-ro
dump = 0
pass = 1

YAML:
```
mounts:
  - device: /dev/hda1
    point: /
    type: ext3
    errors: remount-ro
    dump: 0
    pass: 1
```
(Unless we only allow mount point specifications in the file, it seems we need an explicit list of them all, and the dash begins an item on it.)

As you can see, the original is by far the “lightest” of the formats, but without the comment describing each (poorly separated) field, provides little help in understanding the contents. Next comes the .INI and on the surface the YAML example. The latter is actually more complex, however, thanks to the explicit dashed list. Since YAML reserves a lot of character sequences for its own purposes, the complexity can in reality be even higher than this example would indicate. I do not fancy dodging a zillion “occasionally reserved” characters in unexpected places, as is the case in most “ASCII markup” languages (including YAML, ReST) that try to have a special character for everything (markdown apparently being the sole case with at least a bit of moderation). I would not mind a simpler colon-separated and intended format, however.

Back in the syntax lightness comparison, XML is the clear looser by a wide margin. There's a simple reason for this: XML syntax is based on the SGML (HTML) syntax, a syntax designed for marking up or tagging segments within a text document or other data. The syntax has been designed for situations, where most of the content is data, although even there a TeX-style syntax can be far superior and less verbose. In configuration files, by contrast, a far greater proportion of the information is in the structure and attributes. Hence a markup syntax becomes far too heavy.

The difference between the variant of the braced format featured here, and S-expressions isn't big, but there are less verbose variants of the braced format, that wouldn't necessary need the quotation marks. The variant featured here could in fact be parsed by Lua, and as I mentioned, programming language syntaxes tend to require syntactical elements not necessarily needed in configuration files, resulting in a strict and hence non-robust format. It is precisely the robustness, that is great about .INI. Notice how you don't need to worry about terminating the mount block lines after where it begins, and how you don't need to separate the fields under it with commas. There's less for the novice user to worry about – to be afraid of – and hence less need for separate tools for editing the configuration file. But even for the advanced user it is more convenient to not have to worry about syntax errors.

As for the automatic processing premises set in the beginning of this posting, all .INI and YAML, as well as S-expressions and the braced format, as syntaxes for describing trees, removed from the context and extra syntax of the aforementioned programming languages, have a standard(isable) structure that could be “understood” by various tools. However, only .INI as largely line-based, can with relative ease be processed with existing and widely available quick&dirty tools. (The other formats do not enforce line-basedness so much.)

.INI extensions. The basic .INI syntax is, however, rather limited, and would for some uses need to be extended – or semantics standardised – to support configuration “trees” deeper than one level, to support multiple blocks with the same name, and to be able to include fields with values possibly spanning multiple lines. For multiple levels it suffices, for example, of the block [foo.bar] to be specified to refer a previous [foo] block, and be a sub-block of that instead of a completely new one. Likewise, at the configuration file loading library level, I prefer that each [block] statement begins a new completely separate block, instead of appending to an existing one with the same name, as some variants of the format presently do. Other behaviours can be supported at a higher semantic level (and possibly described or the number of blocks with the same name restricted in something like XML DTDs).

Multi-line values for settings also demand extended syntax. One could, of course, settle for quotes, but I think there's a nicer alternative, more in line with the robustness of .INI. This choice of syntax is reminiscent of the so-called literate programming style, where the documentation of the source code forms the major part of the source file, with intermittent segments of source code. (Speaking of this and configuration documents, one possible approach for configuration files would literate configuration in the obvious analogous style.) In applying this syntax to the embedding of scripts, such as key binding callbacks, in the configuration file, one might thus likewise speak of “configuration programming style”. The syntax I have in mind, is to prefix each line containing data/code with a particular symbol (e.g. >), just like comment lines are prefixed with one (often #, although ; is seen in .INIs):

[binding]
key = A
callback = if something
         >     then do_something
         >     else do_something_else

I seem to prefer the indentation, but in an .INI-style syntax it should not be necessary. If it were, one could even do without the markers. Alternatively, it might be a bit cleaner to only allow data segments as complete blocks, so that blocks contain either only data or only variable settings:

[binding]
key = A

[binding.callback]
> if something
>     then do_something
>     else do_something_else

See how the syntax still remains quite robust in both cases: there's no need to close the code/data segment, and a missing symbol will only result in one uninterpretable or misinterpreted line, unless the line contains a block marker. Note that the callback code is not parsed by the configuration file parser. Instead, all the configuration file structure is contained in the beginnigs of the lines, and everything after a special marker symbol, such as the equality and inequality signs, up to the end of the line, is taken to be data. Of course, when there's a lot of code or other data, the advantages of this kind of syntax over XML and other (better) tag-based formats gradually start to disappear.

Stay tuned. There is still other unfinished business with regard to scripting, namely the redundancy between scripting support and configuration files, when the configuration files are not simply scripts. That, however, I leave for another posting, given the relation to other issues I wish to discuss, and drifting away from our original subject of the merits and failures of different forms of configuration files. Here it should suffice to simply remark that the extra effort from the redundancy seems to be removable to a great extent by suitable tools – largely non-existent – and these are to be discussed.

Article: