XML is almost always misused

In 1996, XML was invented. No sooner than it was created, it was adopted for all manner of misconceived applications for which it was a poor choice.

It is no exaggeration to say that the vast majority of all XML schemas I have ever seen have constituted an inappropriate or misguided use of XML. Moreover, these misapplications of XML fundamentally fail to understand what XML is in the first place.

XML is a markup language. It is not a data format. The majority of XML schemas fail to appreciate this distinction, confuse XML with a data format, and thus were mistaken to adopt XML in the first place because what they were really looking for is a data format.

Broadly speaking, XML excels at annotating corpuses of text with structure and metadata. If what you have in the first place is not a corpus of text, XML is unlikely to be a good choice.

This being in mind, there is a simple test for determining if an XML schema is well designed: Take an exemplary document in the proposed schema, and remove all tags and attributes from it. If what you have left over does not make sense (or is the empty string), either your schema is badly designed or you shouldn't be using XML at all.

Here are some very frequently occurring examples of bad schema design:

  1. <root>
      <item name="name" value="John" />
      <item name="city" value="London" />
    </root>

    Here we have an example of a misguided and bizarre (yet frequently seen) attempt to express a simple key-value dictionary in XML. If we remove all tags and attributes, we have the empty string. Essentially, this document is, nonsensically, the semantic annotation of the empty string.

  2. <root name="John" city="London" />

    Even worse, now not only are we semantically annotating the empty string as a bizarre way of expressing a dictionary, but the “dictionary” is now directly encoded as the attributes of the root element. This makes the defined set of attribute names on the element undefined and dynamic. Moreover, it demonstrates that all the author really wanted was a simple key-value syntax, but instead has completely bizarrely decided to use XML, forcing them to use a single empty element just as a pretext to use the attribute syntax. Yet I have seen such schemas far too often.

  3. <root>
      <item key="name">John</item>
      <item key="city">London</item>
    </root>

    This is a half-improvement, but now the keys are for some reason metadata and the values aren't, which is a very strange position to take on the nature of a dictionary. If we remove all tags and attributes, we lose half of our information.

The correct way to express a dictionary in XML is something like this:

<root>
  <item>
    <key>Name</key>
    <value>John</value>
  </item>
  <item>
    <key>City</key>
    <value>London</value>
  </item>
</root>

But if the people who made the strange decision to use XML as a data format, and to then use it to serialize a dictionary chose this schema they might realise that what they're doing is unsuited to it and unergonomic. More often then when the designer mistakenly chooses XML for their application, they then double down with the nonsensical use of XML with one of the above forms so as to avoid confronting the fact that XML is a poor fit for their application.

The worst XML schema ever? Incidentially, the award for the worst XML schema I have ever seen goes to the autoprovisioning configuration file format for Polycom IP phones. These demand XML files loaded over TFTP, which, well, here's an extract from one:

<softkey
        softkey.feature.directories="0"
        softkey.feature.buddies="0"
        softkey.feature.forward="0"
        softkey.feature.meetnow="0"
        softkey.feature.redial="1"
        softkey.feature.search="1"

        softkey.1.enable="1"
        softkey.1.use.idle="1"
        softkey.1.label="Foo"
        softkey.1.insert="1"
        softkey.1.action="..."

        softkey.2.enable="1"
        softkey.2.use.idle="1"
        softkey.2.label="Bar"
        softkey.2.insert="2"
        softkey.2.action="..." />

This is not a sick joke. This is not something I made up:

  • Elements are just used as a pretext to attach attributes which themselves have hierarchical names.
  • If multiple instances of a given kind of thing need to be instantiated, you have to do so by using attribute names with indices in them.
  • Not only that, attributes beginning with softkey. need to be placed on an element <softkey/>, attributes beginning with feature. need to be placed on an element <feature/>, etc., despite this being completely redundant and serving no apparent purpose.
  • Finally, just as an extra “fuck you”, if you were hoping that the first component of the attribute name always matches the element name, nope! For example, up. attributes must be attached to <userpreferences/>. The mapping from attribute names to what elements they must be attached to is more or less completely arbitrary.

Documents vs. data. From time to time someone will do something really strange and compare XML and JSON, proving that they understand neither. XML is a document markup language; JSON is a structured data format, and to compare the two is to compare apples and oranges.

The documents vs. data concept is useful for understanding this. XML can be roughly analogised to a machine-intelligible document; though machine-readable, it is still embedded in the document metaphor, and is in this regard actually comparable to the generally machine-unintelligible PDF. This is distinct from data, which is independent of any particular representation of that data in the document metaphor.

To use an example, in XML the ordering of elements is significant. Whereas in JSON the ordering of the key-value pairs inside objects is meaningless and undefined. If you want an unordered dictionary of key-value pairs, the actual ordering of the physical representation in the file is meaningless; but you could produce many different documents from that data, because a document has a concrete order to it; it is metaphorical paper, though unlike a physical printed document or a PDF, it is one without physical dimensions.

The example I gave of how to correctly represent a dictionary in XML necessarily gives an order to the elements in the dictionary, unlike a JSON representation. I can't choose not to express such an ordering; this linearity is inherent to the document metaphor and to XML. Some program interpreting this XML document might choose to disregard the ordering, but this is a moot point, as this is out of scope of the discussion of the format itself. Moreover, making the document viewable in a web browser by attaching a CSS stylesheet to it will show the elements of the dictionary in the given order, not in any other order.

In other words, a dictionary (a piece of structured data) can be converted into n different possible documents (XML, PDF, paper or otherwise), where n is the number of possible permutations of the elements in the dictionary, and this is before we consider other possible variables.

However, it also follows from this that if you want to convey pure data, a machine-intelligible document is an inefficient way to do it. It introduces a wholly unnecessary obfuscating metaphor and code must be written to extract the original data from this document. There is little reason to use XML for anything not intended at some point or another to be directly formatted for human consumption as a document (say, by CSS or XSLT or both), because this the primary (or only) reason to cling on to the document metaphor. Moreover, since XML has no notion of numbers (or booleans, or other data types), any numbers represented are just considered more text. The schema and its relation to the underlying data expressed must be known to recover the data, and to know when, contextually, some piece of text represents a number and should be converted to one, etc.

The process of recovering data from XML documents is thus not wholly dissimilar to the process of recovering data by OCR'ing scanned printed documents, say for example containing tables forming pages and pages of numerical data. Yes, you can do it, but it's suboptimal and should only be done if there's absolutely no alternative. The sane solution is simply to obtain a digital copy of the original data, not embedded in a document metaphor which conflates the data with a specific textual representation of it.

It doesn't, however, surprise me that businesses like XML, precisely because businesses understand the notion of (paper) documents and want to continue with a familiar metaphor that they understand — for the same reason businesses overuse PDFs instead of more machine-friendly formats because they remain attached to the notion of a printed page with a particular physical size, even for documents highly unlikely ever to be printed (e.g. 8000 page PDFs of register documentation). In this regard, business use of XML is essentially a skeuomorphism. People understand the metaphorical idea of a printed page, with finite size, and they understand how to craft business processes out of printed documents. If this is your baseline, documents without a finite physical size and which are machine-intelligible — XML documents — represent innovation relative to it, while retaining the comfortingly familiar document metaphor. However this remains an impure and needlessly skeuomorphic representation of data. To date, the only XML schemas I have seen which I would actually consider a good use of XML are XHTML and DocBook.