DISCLAIMER: This site is a mirror of original one that was once available at http://iki.fi/~tuomov/b/
Premises.
If we look at the usual ways of storing program and other
configuration data, the following general schemes can be found:
1) “proprietary” configuration files, both binary and text format;
2) scripting languages;
3) filesystems and databases;
4) text and binary files encoding a tree structure in some
“standard” format.
Schemes of the first type include all the traditional *nix configuation
files that populate /etc
. The second scheme overlaps the first scheme
to some degree. Although there's less variety in syntax, the semantical
level remains highly “proprietary”.
Schemes of the third type include the Windows registry, as well as storing
all settings as individual files on a normal file system, one that can
efficiently store small files. Schemes of the fourth kind again bear some
relationship to the third. Examples of this kind of configuration storage
are unfortunately few besides XML, other formats either not being so
widespread or so standardised, or both.
I will argue for standard structural formats, but against XML in
particular. Indeed, this posting could have been subtitled “XML sucks,
Part 2”. My premises are these:
1) the data must be accessible by a great variety of widely available
interactive tools in a easily human-understandable format. Presently these
tools are the text editors: specialised editors for particular
configuration file formats do not come even close to text editors in
variety; do not offer choice to fit everyone's tastes.
2) It should likewise be possible to automate operations on the data with
little “understanding” of it, both with widely available programming
libraries, and in the majority of cases, also by such “quick & dirty” shell
tools as sed
and grep
. I do not, however, confine attention to these
particular tools: although presently important, in the future some other
tools may take their position.
3) The data should be stored as efficiently as possible without compromising
the other premises.
The rationale for these premises is simple: 1) While an interactive program
may itself provide a way to access (at least part of) its configuration
data, it should not be necessary to be able to run it to alter its
configuration – which may even be preventing that. Instead of writing a
new tool for every program's every configuration file, it is better to
concentrate on providing a variety of tools that can suit everyone's
tastes, and work with a range of programs, not all interactive.
2) When configuration files are understood in
a wide sense, there are often cases when being able to automate
operations speeds up the task a lot. Most of these tasks are “ad hoc” in
nature, and a sed
dable and grep
pable format suffices. However, for
“production” quality automation, it should be possible to obtain a more
abstract presentation of the data without wasting effort on writing parsers
or interpreters. 3) It should be self-evident that is mindless to
waste resources if nothing is gained by it.
I will elaborate below on the effect of these premises on the the
different schemes mentioned above.
Proprietary formats.
The premises above clearly exclude proprietary binary formats. But they also
exclude traditional *nix configuration files. It is true, owing to their
line-basedness, that these files are often easily processed with the
aforementioned sed
, grep
, and various other shell tools. However,
they do not possess a standard structure – and can not possess a rich
one – that could be understood by higher-level general-purpose tools.
These files also often leave a lot to be desired on the part
of human-understandability: the most extreme example is perhaps given
by sendmail, but there are other quite bad examples. Some
of these files are, in fact, shell scripts in disguise, and thus
included in the next scheme.
Scripting languages. Scripting language based configuration storage is very flexible in certain senses: repeated tasks in the configuration file can be automated, configuration can be fetched from other sources without the program specifically supporting this, and redundant work isn't needed in providing both scripting support and configuration files, and their documentation.
These features precisely, however, make these files very unflexible in other senses: the halting problem is our adversary here. It is theoretically impossible for programs to understand the configuration data in every file with code in a Turing-complete language. We can, of course, always try to push the unexplored, endless frontiers of the halting problem further and further out. But such hacks are not without significant cost, and it would be easier to simply limit the expressive power of the configuration language. Especially as the syntax of most programming languages is complex and strict enough to be daunting to both quick&dirty tools and humans as well – not always only novice. Lisp/Scheme are somewhat of an exception: the syntax is very simple. Unfortunately it remains potentially baffling, being even too simple and demanding strict placement of Loads of Infuriating Silly Parentheses.
That said, there are very good reasons for providing scripting support in programs. This in turn creates the temptation to simply use the same scripting language for configuration, to avoid redundant work. I have also done that myself, switched from a proprietary configuration file format to Lua for the configuration of Ion. I have a wonderful excuse, already mentioned above: the effort of providing both scripting support and simpler configuration files, including documentation for both. Perhaps if there were more good tools for a good widespread configuration file format, there would be a higher willingness to do that extra work. Even better would be if there were good tools to take away the extra work. I will return to this topic.
File system storage. Configuration data stored on a file system, or a database mountable as a file system, of course offers structural access even at the shell level, so in that sense this scheme is ideal. However, there presently aren't that good tools for a human-understandable “full” view of a program's configuration stored in this manner. In particular, the documentation that can be embedded in a configuration file would be missing, and it should be the job of the configuration editor to display that.
An interesting possibility is provided by filesystems that support
sub-files, i.e. nodes of the file system working both as directories
and files at the same time. In that case, the documentation of an
option could be conveniently available in the doc
sub-file of the
configuration option. Yet more interesting possibilities are provided
by non-hierarchical file systems, where the documentation
for an option could easily be be accessed in multiple locations:
/program/doc/config/option
, /config/program/option/doc
,
/doc/program/config/option
, and so on. Therefore the documentation
would be accessible both in a documentation “tree” (there are no trees
on a ‘setfs’, but the idiom shall suffice for now) as well as
be conveniently found within the configuration “tree”, when editing
the configuration through the file system with basic shell tools.
(All these thrilling possibilities offered by advanced file systems
remind me of Plan K from Kludgespace: the effort of trying to
provide in Linux with FUSE and other kludges, a poor
approximation of the neat file system based design of Plan 9
– the logical next step from one of the fundamental principles of Unix.)
In any case, presently file systems aren't quite up to the task of storing individual variables in files. Just as important as the lack of convenient editing tools (or sub-files for shell tools), is the fact that most file systems need a lot of space even for small files. (Yes, I am aware of ReiserFS.) Therefore we turn to the next related scheme, where we have crammed all the variables (and hopefully some documentation) in a single file on the file system, still retaining a standard structure. This structure could, in fact, in principle be displayed as a file system by, for example, a FUSE module.
Standard structural files.
As already noted, perhaps the only real example in this category
of standard structural configuration files, is XML. But XML syntax
clearly violates all of the premises I have set: It is not human-readable.
It is not efficient. There is no variety of XML editors that would
hide the awful syntax, providing in a usable manner access to
the meat encased in layers of fat. Yes, XML structure can be parsed
by a (complex) parsing library that does not much understand the
actual contents of the file, and that is a good aspect. (Although,
programming language constructs have, in fact, been attempted in XML
– what wouldn't – with very poor results.) But the
syntax does not lend itself to sed
ding and grep
ping for
quick&dirty tasks, and I am not aware of widely available, simple,
shell tools for processing XML. I would, however, welcome
a structural shell along with structural sed
and grep
. But you
don't need – or want – the inefficient XML for that.
So why have an awful and terribly inefficient syntax? Why not have
a nice syntax, if you lose nothing by doing so? XML syntax is so
inefficient and so unreadable, that nothing would be lost by
simply storing the structure in a standard binary format that
would be much faster for programs to load, and waste much less
bandwidth or processing power (hence energy and eventually nature)
for transmission or compression. Yes, you mindless XML fanatics,
you who were born with silver bullets up your asses: you can have
standard binary formats containing just the same data as XML files.
XML is simply a terribly inefficient encoding of a tree structure.
(The contents of a tag are essentially just another attribute
with a special reserved name, and a structural value.) Indeed,
given a variety of structural tools, including editors, sed
,
and grep
, there would be no need to inefficiently encode data
in a textual format, as everything would be able to understand
the standard binary structure format just as they understand the
ASCII values of characters.
Yes, you can also encode the information in an XML file in as many different formats of text files as you can imagine. Whereas there unfortunately doesn't appear to be a widespread standardised alternative to XML syntax, at least three styles of syntax in multiple variants are widely in use, all better than the XML syntax: firstly there's the S-expression syntax, that unfortunately usually comes combined with a Turing-complete programming language, the aforementioned Lisp or Scheme. Then there are various braced syntaxes (including the one of CSS), and .INI. The last style, while in many senses my favourite, unfortunately does in its most variants appear to have some limitations that I will come back to. There's also YAML, which actually seems to have some momentum behind it, but I find it too complex (and monoculturist).
As an example, let us have a look at all these syntaxes, along
with a traditional *nix configuration file, for the task of
configuring the mount point and mounting options for a device. That
is, let us look at the question of converting /etc/fstab
to another
format. Of course, the rendition below is only one possibility of my
arbitrary choosing for each of the formats, but should in any case
provide a ‘feel’ for each of the formats. I will also not produce the
full file (which at least in case of XML would normally have still
more tags in it for describing the file), but only a fragment.
The “proprietary” original format:
# <file system> <mnt.pnt> <type> <options> <dump> <pass>
/dev/hda1 / ext3 defaults,errors=remount-ro 0 1
XML:
<mount>
<device>/dev/hda1</device>
<point>/</point>
<type>ext3</type>
<errors>remount-ro</errors>
<dump>0</dump>
<pass>1</pass>
</mount>
S-expression:
(mount
(device '/dev/hda1)
(point '/)
(type 'ext3)
(errors 'remount-ro)
(dump 0)
(pass 1)
)
Braced:
mount{
device = "/dev/hda1",
point = "/",
type = "ext3",
errors = "remount-ro",
dump = 0,
pass = 1,
}
.INI:
[mount]
device = /dev/hda1
point = /
type = ext3
errors = remount-ro
dump = 0
pass = 1
YAML:
mounts:
- device: /dev/hda1
point: /
type: ext3
errors: remount-ro
dump: 0
pass: 1
(Unless we only allow mount point specifications in the file, it seems we need an explicit list of them all, and the dash begins an item on it.)
As you can see, the original is by far the “lightest” of the formats, but without the comment describing each (poorly separated) field, provides little help in understanding the contents. Next comes the .INI and on the surface the YAML example. The latter is actually more complex, however, thanks to the explicit dashed list. Since YAML reserves a lot of character sequences for its own purposes, the complexity can in reality be even higher than this example would indicate. I do not fancy dodging a zillion “occasionally reserved” characters in unexpected places, as is the case in most “ASCII markup” languages (including YAML, ReST) that try to have a special character for everything (markdown apparently being the sole case with at least a bit of moderation). I would not mind a simpler colon-separated and intended format, however.
Back in the syntax lightness comparison, XML is the clear looser by a wide margin. There's a simple reason for this: XML syntax is based on the SGML (HTML) syntax, a syntax designed for marking up or tagging segments within a text document or other data. The syntax has been designed for situations, where most of the content is data, although even there a TeX-style syntax can be far superior and less verbose. In configuration files, by contrast, a far greater proportion of the information is in the structure and attributes. Hence a markup syntax becomes far too heavy.
The difference between the variant of the braced format
featured here, and S-expressions isn't big, but there are less
verbose variants of the braced format, that wouldn't necessary need
the quotation marks. The variant featured here could in fact be
parsed by Lua, and as I mentioned, programming language syntaxes
tend to require syntactical elements not necessarily needed in
configuration files, resulting in a strict and hence non-robust format.
It is precisely the robustness, that is great about .INI. Notice how you
don't need to worry about terminating the mount
block lines after where
it begins, and how you don't need to separate the fields under it with
commas. There's less for the novice user to worry about – to be afraid of
– and hence less need for separate tools for editing the configuration file.
But even for the advanced user it is more convenient to not have to
worry about syntax errors.
As for the automatic processing premises set in the beginning of this posting, all .INI and YAML, as well as S-expressions and the braced format, as syntaxes for describing trees, removed from the context and extra syntax of the aforementioned programming languages, have a standard(isable) structure that could be “understood” by various tools. However, only .INI as largely line-based, can with relative ease be processed with existing and widely available quick&dirty tools. (The other formats do not enforce line-basedness so much.)
.INI extensions.
The basic .INI syntax is, however, rather limited, and would for
some uses need to be extended – or semantics standardised – to
support configuration “trees” deeper than one level, to support
multiple blocks with the same name, and to be able to include fields
with values possibly spanning multiple lines. For multiple levels it
suffices, for example, of the block [foo.bar]
to be specified to
refer a previous [foo]
block, and be a sub-block of that instead
of a completely new one. Likewise, at the configuration file loading
library level, I prefer that each [block]
statement begins a new
completely separate block, instead of appending to an existing one
with the same name, as some variants of the format presently do.
Other behaviours can be supported at a higher semantic level
(and possibly described or the number of blocks with the same name
restricted in something like XML DTDs).
Multi-line values for settings also demand extended syntax. One could,
of course, settle for quotes, but I think there's a nicer alternative,
more in line with the robustness of .INI. This choice of syntax
is reminiscent of the so-called literate programming style, where the
documentation of the source code forms the major part of the source file,
with intermittent segments of source code. (Speaking of this and
configuration documents, one possible approach for configuration
files would literate configuration in the obvious analogous style.) In
applying this syntax to the embedding of scripts, such as key binding
callbacks, in the configuration file, one might thus likewise speak of
“configuration programming style”. The syntax I have in mind, is to prefix
each line containing data/code with a particular symbol (e.g. >
), just
like comment lines are prefixed with one (often #
, although ;
is seen
in .INIs):
[binding]
key = A
callback = if something
> then do_something
> else do_something_else
I seem to prefer the indentation, but in an .INI-style syntax it should not be necessary. If it were, one could even do without the markers. Alternatively, it might be a bit cleaner to only allow data segments as complete blocks, so that blocks contain either only data or only variable settings:
[binding]
key = A
[binding.callback]
> if something
> then do_something
> else do_something_else
See how the syntax still remains quite robust in both cases: there's no need to close the code/data segment, and a missing symbol will only result in one uninterpretable or misinterpreted line, unless the line contains a block marker. Note that the callback code is not parsed by the configuration file parser. Instead, all the configuration file structure is contained in the beginnigs of the lines, and everything after a special marker symbol, such as the equality and inequality signs, up to the end of the line, is taken to be data. Of course, when there's a lot of code or other data, the advantages of this kind of syntax over XML and other (better) tag-based formats gradually start to disappear.
Stay tuned. There is still other unfinished business with regard to scripting, namely the redundancy between scripting support and configuration files, when the configuration files are not simply scripts. That, however, I leave for another posting, given the relation to other issues I wish to discuss, and drifting away from our original subject of the merits and failures of different forms of configuration files. Here it should suffice to simply remark that the extra effort from the redundancy seems to be removable to a great extent by suitable tools – largely non-existent – and these are to be discussed.