Mitigating the Hetzner/Linode XMPP.ru MitM interception incident

(If you just want some recommendations for what to do, skip down to the Recommendations section below.)

Today, the operator of jabber.ru and xmpp.ru reported that their service had been successfully subject to a man-in-the-middle attack via a combination of

their hosting providers, Hetzner and Linode, intercepting traffic to their machines; and
the unauthorized issuance of a Domain Validation certificate for their service by an attacker.

It seems likely that this attack was orchestrated by the state of Germany (or Germany acting in concert with one or more other nation states). There are other possibilities; for example, both Hetzner and Linode might have decided to voluntarily comply with a wiretapping request from a foreign power that was not binding upon them, but this would reflect extremely badly on them, might well be illegal, and seems unlikely.

Detection. This attack could have been mitigated. It could also (potentially) have been detected:

The first way to detect this attack would be for the operators of the service to monitor Certificate Transparency logs to detect the issuance of certificates they did not request. There are some services which can do this for you, but we could probably still stand to have better tools here (e.g. tools which are good at notifying you only of certificates you didn't request).
The second way to detect this attack would be to periodically connect to the service and check that the public key used by the TLS server matches that expected.

Both of these detection methods have some issues and potential gaps:

CT is optional. A certificate issued by a legitimate CA isn't necessarily logged to a CT log. Surprisingly, CT logging is still not a requirement of the CA/Browser Forum Baseline Requirements (which set the rules all CAs must follow).

What forces CAs to log certificates in CT is that web browsers now reject certificates unless they contain a cryptographic proof that they were logged to a CT log. Some CAs will sell you a certificate that isn't logged to CT (e.g. for “privacy” reasons) if you request one. Browsers may reject this, but there are many other kinds of client application (most of them, in fact) which don't check for or require a certificate to contain a proof of having been logged. So in this hypothetical scenario, the adversary could have procured an unlogged certificate.
Selective MitM. Trying to detect the MitM by probing the service could be worked around by detecting which connections are probing connections and not MitM'ing them. At a minimum it would be necessary to do something like perform the probe through Tor to prevent it from being trivially identified; however, this probably isn't perfect either. TLS stacks can be easily fingerprinted, as things like the order that TLVs are listed in give telltale signs of which TLS implementation is being used; it's quite likely that a service could have some level of success distinguishing between connections made by a real XMPP client and a probe agent. There's a million signals that could potentially give away that a connection isn't a “real” connection. An adversary could also just target specific persons it knows it is interested in intercepting (i.e., only MitM traffic on a whitelist, rather than exempting traffic on a blacklist of known probes). This therefore probably can't be considered too reliable either.

So neither of these detection methods seem particularly perfect.

Mitigation. The second area of consideration is mitigation, in which the unauthorized issuance of TLS certificates is prevented from happening in the first place. The entire point of a TLS certificate is, of course, to prevent a man-in-the-middle attack. The fundamental problem here is that the “Domain Validation” model by which CAs validate control of a domain name is ironically itself vulnerable to man-in-the-middle attacks, especially if an attacker can intercept not just some but all traffic to a victim site (as happened in this case).

Some years ago I authored ACME-CAA (RFC 8657), now implemented by Let's Encrypt, which can mitigate this in some circumstances. The basic idea is that you can configure a DNS record which specifies that only a specific account of a specific CA is authorised to issue certificates for a domain. Thus simply using the same CA isn't enough; you must gain access to the same account at the CA. With Let's Encrypt, this means gaining access to the ACME private key used to request a certificate. Based on what we know about the attack, it would have been prevented by deploying this extension.

There are a fair number of caveats here, which are explained in full in the RFC and my deployment advice (recommended reading). The RFC is a lot more readable than most, so flipping through it is highly recommended for those interested in deploying ACME-CAA. Some caveats are as follows:

You do need to deploy DNSSEC for this to work, otherwise the DNS requests made by a CA can simply be intercepted.

Anyone who can get control of your DNSSEC signing key can also overcome this hurdle. So for example, a nation-state might simply serve a wiretap order to a hosting company like Hetzner or Linode, and similarly order your DNS service provider.

It should be noted that it is possible to run a DNSSEC-secured DNS zone without giving your signing keys to anybody else; in this case the DNS hosting provider has no power to compromise the zone, so this seems like the best deployment strategy.
An adversary might be able to successfully compel your domain name registrar, or the TLD registry, to change the DNSSEC signing keys registered for a domain. This at least has the potential to be a “noisy” operation, and due to the nature of DNS caching, it may be hard for an adversary to prevent a recurring probe from detecting the change of key (unlike the selective MitM of a TLS connection discussed under Detection above).
An adversary might be able to successfully compel your CA to mis-issue a certificate.
You remain vulnerable to third party CAs which screw up or break the rules. The CA/Browser Forum Baseline Requirements now require DNSSEC to be checked by CAs, but a third party CA might mess up and issue a certificate anyway even if it's not listed by a domain's CAA record as authorised to issue certificates for the domain. Because logging to CT logs isn't a requirement, such certificates may never even be detected.

This is not an exhaustive list of cavets and you should refer to the RFC for the full details. Nonetheless, deploying ACME-CAA can offer a real level of mitigation here. It increases the number of hurdles for an attacker, especially when you spread different services around different jurisdictions. The game here is jurisdictional arbitrage and utilising the relative difficulty of international cooperation between adversarial powers. For example, if we assume that this incident is the product of coercion on the part of the German state, it doesn't necessarily follow that this adversary would be able to also coerce Let's Encrypt, for instance. Increasing the cost of attacks and the risk of them being detected also discourages nation-state adversaries, particularly as they are often loathe to have attention drawn to their espionage activities.

What would a perfect attacker do? While the core aspects of this attack may have been readily mitigated with technologies which were available but undeployed, it also has highlighted some serious gaps in the TLS infrastructure as it is deployed today.

In this particular incident, the adversary was slapdash, and let their illicit certificate expire. In this regard, they are less than a perfect adversary; but we should expect attacks such as these to get better, not worse, and to become more frequent, as nation-states become more frustrated by the presence of cheap and easy encryption.

As such, it's useful to consider what a more competent nation-state adversary (whom I'll name Mallory for our purposes) would do. Here, I'll assume that Mallory can do anything other than actually compromise the victim machine itself or its operator (which is not a good assumption, but bear with it):

Mallory would take advantage of the fact that CAs aren't required to log certificates to Certificate Transparency logs, and request an unlogged certificate from a CA.
Mallory would compel the hosting provider to MitM all traffic going to the victim machine.
Mallory would use TLS stack fingerprinting and source IPs to heuristically identify traffic likely to be of interest and exclude traffic likely made to probe if the victim service has been compromised.
Mallory would use the MitM to trick the CA into thinking Mallory is the legitimate controller of the victim domain.
If the domain uses ACME-CAA with DNSSEC, this attack is foiled, so Mallory would attempt to compel the DNS hosting provider (if it holds the DNSSEC signing key, which it may not). If Mallory fails to do so, Mallory might try and coerce the domain registrar or TLD registry, but might have difficulty preventing this from being detected (if anybody is monitoring it, which usually isn't the case).
Mallory also might attempt to coerce the CA itself.

Holes in the TLS infrastructure. This hypothetical attack illustrates the following holes in the present state of the public TLS infrastructure, ordered in descending order of severity (in my opinion):

Lack of CT logging enforcement by non-web TLS clients. Point (1) here is interesting, even if the actual attacker here did not take advantage of it. Web browsers now require that CA certificates include cryptographic proof of having been logged to a Certificate Transparency log, so while you can legitimately get an unlogged certificate from a CA, it's not so useful on the public web.

However, it turns out there is a degree of inconsistency here. While web browsers enforce this requirement, most other TLS clients don't, and will accept an unlogged certificate. This probably includes most XMPP clients, as this CT validation is something that needs to be specially implemented. It is not something that is enabled “automatically” just by linking your application against OpenSSL.

This means that a large amount of internet infrastructure which supposedly benefits from the security provided by contemporary TLS, actually receives a lower standard of protection than that of a web browser. Such infrastructure is easier to exploit because it will accept an unlogged TLS certificate, which CAs are legitimately allowed to issue.

This seems to me a highly undesirable state of affairs, and thought should be given to how CT proof enforcement can be enabled by default in the future. Software which is liable to be used for highly sensitive communications (such as XMPP clients) should also consider reviewing how they are currently handling this issue and consider adding support for CT enforcement.
Lack of requirement to log certificates. The CA/Browser Forum Baseline Requirements, which is the set of industry rules CAs are required to follow (lest they be given the CA death penalty), does not require certificates to be logged to CT logs. This is arguably a surprising omission and I would argue this should be required.

Of course, in reality, this omission is surely not accidental; there are too many companies which think hiding a hostname from a CT log is a meaningful form of security by obscurity, and I assume it's the advocacy of these organisations that has kept CT from being mandatory. But it is certainly something I would advocate for, and I would hope the CA/Browser Forum reconsiders this in the future.

In any case, this issue can be rendered moot by universal deployment of the technical enforcement of CT logging, as per the above paragraph.
Need for more CT monitoring services. This is not really a gap in the TLS infrastructure as such, as anyone could create such a service, but the public would benefit from more CT monitoring services. The current major service available for CT alerting is SSLMate's Cert Spotter, but its pricing probably puts it out of reach for many services, especially those operated on a voluntary basis.

It should be emphasised that any CT monitoring solution needs to think through how to avoid false alarms and only alert when observing unusual or seemingly unauthorized certificate issuance.
Lack of DNSSEC transparency. Although the idea has been mulled in the past, there is currently no deployed cryptographic infrastructure for detecting changes to the DNSSEC keys configured for a domain. This renders domain name registrars and TLD registries vulnerable to coercion or compromise to change the DNSSEC keys registered for a domain, which allows the protection which ACME-CAA can offer to be undermined.

It would be highly desirable to see a transparency solution for DNSSEC zone keys ensuring that all changes are publicly visible. This transparency solution would only need to log changes to a zone's keys (DS records), not all records in a zone. (The latter is also possible but I wouldn't consider it essential, and organisations who have grown accustomed to the contents of DNS zones being un-enumerable would inevitably complain.)

Recommendations

With all this in mind, here are some recommendations for various parties:

I operate a service which I am concerned may be targeted by a nation-state. What should I do?

Mitigating the Hetzner/Linode XMPP.ru MitM interception incident

Recommendations

See also