Attributes of configuration languages
Software, particularly server software or the software running on network equipment, frequently requires configuration, often provided in the form of a configuration file. No particular standard for a configuration language has ever come to dominate, so the number of configuration file formats is almost as large as the number of pieces of software needing configuration.
Many of these configuration formats lack a formal specification of their syntax or semantics and are implemented as ad-hoc parsers inside the software that consumes them. Many of these formats also appear similar to other formats but with slight differences, as inspiration for how to design configuration languages flows from one influential piece of software to another. For example, no formal specification exists for INI files, but many applications have adopted an INI-like syntax, sometimes with notable application-specific variations. The BIND nameserver's configuration format seems also to have influenced many configuration formats now used by *nix server software, again with much subtle variation.
Herein, I attempt to analyse a large number of configuration languages and discern the properties and patterns that seem to pervade all of them. Rather than focusing on syntax, which is ultimately superficial, I will focus on the semantics and data model of a given language.
Attribute I: Ordering. The first attribute which can be used to classify configuration languages into different fundamental categories is whether they are ordered. Many configuration languages are ordered, and many other configuration languages are unordered. By ordered, I mean that the order of the items as they appear in a file is significant.
For example, Apache's httpd.conf
is an ordered language because the
configuration items (“directives”) are processed in order, and changing the
order of the directives can change the processing outcome. For example, the
RewriteRule
directive can be given multiple times and is processed in order.
Since earlier rules can preempt later rules, changing the order of these
directives would change how HTTP requests are handled.
By contrast, some other configuration languages are unordered. This includes, for example, anything consuming JSON, YAML or TOML as a configuration language, because JSON objects are unordered dictionaries. The Windows Registry is similarly unordered. INI files should be unordered, though there are certain to be custom variants which make it essentially ordered in some cases.
Some software may use Turing-complete imperative languages or even popular scripting languages to express configuration, in which case the language itself is necessarily ordered, though this does not necessarily imply the configuration built via its execution uses an ordered data model.
Ordered | Unordered |
---|---|
Apache httpd | JSON |
nginx | YAML |
HAProxy | TOML |
S-Expressions | INI (usually) |
XML | Windows Registry |
OpenDDL | Filesystems |
Augeas | BIND(?) |
JunOS† | JunOS† |
Cisco IOS |
The rising popularity of JSON and languages essentially translatable to it appears to be driving a rise in the popularity of unordered representations. On the other hand, ordered representations appear to be popular with HTTP servers, where it is often desired to process requests according to a sequence of rules where the first matching rule wins, and where it may be inconvenient to have to specify all the rules in one place (e.g. as a JSON array).
Note that for the purposes of classifying languages, I use the primary syntactic constructs one would be expected to use in a given language; this article classifies configuration languages according to how they are intended to be used and by how they are actually used. It would be farcical to argue, for example, that because one could have a JSON configuration file which uses only arrays and no objects at all, that JSON is actually ordered. In other words, when I say a language is unordered, this does not imply that it has no capacity to express ordered sequences, simply that the core paradigm of the language as a means to express configuration is centred around unordered constructs.
Attribute II: Namedness. The second attribute which can be used to classify configuration languages is namedness. A first class of languages, which I will refer to as key-value or named languages, more or less require everything to be named, in contrast to a second class of languages, which I will call unnamed languages. For languages like JSON, while arrays can be used to avoid having to name things, this also imposes ordering. Where an unordered JSON object is used, items included must be named via presence of a key.
To illustrate this example, consider the following simplified hypothetical configuration, which instantiates four worker threads in some daemon. Each worker thread is assigned to handle different kinds of work request as they come in.
worker-thread { handle a b; };
worker-thread { handle b c; };
worker-thread { handle b c; };
worker-thread { handle c d; };
This is an unordered construct; it doesn't matter in which order the worker threads are spawned during initialisation, and changing the order doesn't change the outcome. The remits of these workers overlap, and there is no obvious way to assign names to these instances based on their definition. If using a JSON representation, for example, one would have to either
- use arrays, which do not create a naming burden but which are also ordered (though this may well be inconsequential, it is a semantic distinction), or
- require a name to be assigned to each such instance.
This is a subtle and usually rather inconsequential point since making an unordered construct ordered is unlikely to create any issues, but does demonstrate that most unordered languages impose a naming burden on unordered constructs; however, as shown above, this is not a requirement, therefore the attribute of namedness is orthogonal to the attribute of ordering.
Ordered | Unordered | |
---|---|---|
Named/KV | JunOS† | JSON, YAML, TOML, INI, Windows Registry, filesystems, Cisco IOS∆, JunOS† |
Unnamed | Apache httpd, nginx, [HAProxy?], S-Expressions, XML, OpenDDL, Augeas | Cisco IOS∆, BIND‡ |
Reasoning for classification of languages
- JSON
- JSON objects are explicitly unordered constructs. While one could use JSON arrays and not objects and get an ordered, unnamed language, this is not how JSON is intended to be used. Predominant usage focuses on objects with arrays being used for secondary purposes where ordering is desired or naming burden is undesirable. Primary use of JSON objects creates a naming burden, making this a named language.
- YAML, TOML
- See JSON.
- INI
- Pure INI syntax does not support any kind of ordered construct such as arrays, making it a pure unordered named language. It should be noted, however, that INI is very frequently adapted or extended to support things such as ordered constructs or arrays. For example, PHP's `php.ini` configuration file allows the same `extension=` key to be specified multiple times to load multiple extensions. Some software uses specific INI sections as ordered arrays instead of dictionaries, with one entry per line and no keys and just values. TOML can perhaps be viewed as a kind of formalisation of such an evolution.
- Windows Registry
- The Windows Registry is a filesystem-like construct; see Filesystems.
- Filesystems
- Files are not ordered in a directory. \*nix APIs such as `readdir` may return directory entries in a random order; tools such as `ls` sort this output, by default lexicographically, before returning it. From time to time on \*nix systems, there has been a need to allow files in a directory to be “ordered”, for example init scripts, which are often prefixed with a numeric prefix to force a specific sort order, ensuring they are executed in the correct order: `00-foo`, `10-foo`. Thus, filesystems are unordered constructs which occasionally have ordering functionality crudely grafted on top. All files must have a name, making them named constructs.
- Augeas
- Augeas creates a filesystem-like hierarchy, but with the twist that a) multiple nodes at a given level can have the same name, b) the tree is ordered, and c) a node can have both children and a value. Where there are multiple children with the same name, they are disambiguated with an integer index, e.g. `foo[0]`, `foo[1]`, etc. Since `foo` will often be a type name rather than an instance name, this support for multiple nodes with the same name makes Augeas more of an unnamed language than a named one, though it can be used as either.
- Cisco IOS
- ∆The language is unordered and available commands provide no control over ordering. Where ordering is important, for example in IP access control lists where the first match wins, ordering is crudely grafted on top using BASIC-esque line number prefixes (`10 allow ...`, `20 deny ...`, etc.). Most kinds of configuration statement which can have multiple instances require names (e.g. interfaces, ACLs, DHCP pools) to enable them to be referred to from other configuration statements. Occasionally, a configuration statement which can have multiple instances might not accept a name if it doesn't need to be referenced from any other configuration statement and it can be uniquely identified (for example for the purposes of deleting it with the `no` verb) by listing all of its arguments. For example, in a `voice class sip-profiles` statement, `request ANY sip-header Contact modify "A" "B"` and `request ANY sip-header Contact modify "C" "D"` might both be configured. These statements perform a search-replace from A to B and from C to D respectively; such statements can be instantiated arbitrarily many times, yet have no name. Deleting either of these statements requires the `no` verb and typing out the statement to be negated in full; the statement is simply identified by its description, not any kind of name. Considering all this, I'd consider Cisco IOS “mostly, but not wholly, named”.
- JunOS
- †JunOS is something of a hybrid case because it can make the same syntactic constructs either ordered or unordered based on a schema. For example, `unit 0 { ... } unit 1 { ... }` is unordered, but `term A { ... } term B { ... }` is ordered. This is schema-dependent; an `insert` command can be used to move ordered items before or after one another, but cannot be applied to unordered items, which are always shown sorted (e.g. `unit 0` before `unit 1`) when configuration is printed. Most statements are unordered. As for naming, JunOS requires that any configuration statement that can be instantiated multiple times have a name; this is enforced at the schema definition level, making JunOS firmly named.
- Apache httpd
- The language is ordered; for example `mod_rewrite`'s `RewriteRule` is used multiple times to specify a list of rewrite rules. The rewrite rules are applied in order, and earlier rules can short-circuit evaluation.
- nginx
- TODO
- HAProxy
- TODO
- S-Expressions
- Since S-Expressions are comprised entirely of ordered lists and scalar values, they are inherently ordered. They do not inherently impose any naming burden.
- XML
- XML is a document markup language and is neither suited nor appropriate as a configuration format; it is listed here only due to the sheer pervasiveness of such use despite this. As a document markup language, it annotates natural language, a linear, ordered construct, and thus is inherently ordered itself. It does not impose any inherent naming burden (note that things such as element names are considered types and not names for our purposes.)
- OpenDDL
- Both OpenDDL structures and arrays preserve the order of their children.
- BIND
- ‡BIND-like configuration formats do not necessarily create a naming burden, as shown above; however in practice, all BIND constructs are named. This is an application-specific schema design choice, so I've chosen to classify it as unnamed from a language perspective.
Wait, I thought these were supposed to be orthogonal? Though we were able to conceive of an example above of a configuration item which is naturally unordered and naturally nameless, it appears that in practice there is a strong correlation between unordered and named languages, and ordered and unnamed languages. I've listed BIND as Unnamed above since the syntax, as demonstrated above, easily lends itself to unordered and unnamed expression, but I believe in practice all actual BIND directives require names anyway — cases like the above where something can be instantiated multiple times yet have no natural name appear rare, which probably explains why configuration languages largely haven't evolved with that use case in mind.
Unnamed multi-instance items. In the rare cases where unnamed
multi-instance items are needed, some kind of secondary array syntax (that is,
secondary language constructs which allow the expression of sequences or sets
which unlike the primary language constructs, do not impose a naming burden;
for example, JSON arrays) will usually suffice. Generally a consequence of this
will be that all such instances must be specified in one place. Though
requiring this may actually aid comprehension in some contexts, in others it
might hinder it, or make it harder to compose multiple modular configuration
files dynamically. Consider for example an adaptation of our worker-thread
example:
worker-threads = [
{ handle a b; }
{ handle b c; }
{ handle c d; }
];
If humans want to edit configuration files by asking questions like “what is the total list of all configured worker threads”, this actually makes configuration files more readable. However, if the handlers A through D come from completely different modules with different functions, it is also common that humans would like to organise the configuration by module, have each module register its own worker thread, and enable or disable that module simply by including, or not including, its configuration file, which automatically instantiates the needed worker threads. This is thus a tradeoff between two different aspects of configuration readability and maintainability. However, since a configuration file parser could easily be used to produce reports like the total list of configured worker threads from a multi-file representation, optimizing a format for the latter, modular design seems likely to be a preferable tradeoff, as the former end can still be accomplished anyway. Thus the forced use of array-like constructs in this context where a naming burden is not desired might be considered suboptimal, though such constructs appear rare enough that this is not a major flaw. Occasionally, configuration languages work around this limitation by defining merge rules between configuration documents, which may for example cause items in an array in one document to be appended to items in an array in the same position in another document. This can be done with bespoke configuration languages, or by defining custom merge semantics with regards to an existing language such as JSON.
Similarly, although languages in the (Ordered, Unnamed) quadrant above do not impose a naming burden, this does not mean configuration schemas may not frequently choose to require such naming anyway. Either choice is a feasible design when using S-Expressions or BIND-like syntax, for example:
(worker-thread (handle a b)) ;; unnamed
(worker-thread foo (handle a b)) ;; named
worker-thread { handle a b; } // unnamed
worker-thread foo { handle a b; } // named
In this regard “unnamed” is perhaps a bit of a misnomer and practically every language in the Unnamed category will have some facilities for naming things, and require names to be so assigned some of the time; what “unnamed” refers to is that the syntax of the language does not systematically impose a naming burden.
Summary of languages.
Language | Named? | Ordered? |
---|---|---|
Apache httpd | No | Yes |
nginx | No | Yes |
HAProxy | No | Yes |
S-Expressions | No | Yes |
XML | No (attributes can be used to add naming information, but this is not required) | Yes |
OpenDDL | Optional (objects can be named to enable them to be symbolically cross-referenced) | Yes |
JSON | Yes | No |
YAML | Yes | No |
TOML | Yes | No |
INI | Yes | No |
Windows Registry | Yes | No |
Filesystem | Yes | No |
BIND | No (in language terms; schemas actually used mandate naming in practice) | No |
Augeas | No (in language terms; naming easily implemented on top) | Yes |
Cisco IOS | Mostly named, but some statements can be instantiated multiple times based on their unique set of arguments | No |
JunOS | Yes | Mostly unordered, but statements can be defined as ordered via schema |
Work in progress. Note that there may be corner cases with some of these languages that I am not aware of that would fundamentally change my classification of them; this remains a rough view. Also, if you know of a configuration language you think is interesting or unusual not listed here, let me know about it.