Writing a KVM hypervisor VMM in Python

Linux's Kernel Virtual Machine (KVM) is a hypervisor built into the Linux kernel. Whereas in the past Xen proved a more popular hypervisor for multi-tenant clouds — being the original basis of AWS EC2, for example — over the years Linux's own hypervisor has matured and grown in popularity, and now eclipsed Xen as the hypervisor of preference for many clouds. AWS at some point switched to KVM (though it's unclear to me if they still use it or a custom hypervisor at this point), and Google Cloud has used KVM from the beginning.

There are probably many reasons for this, but two in particular come to mind:

  • The greater ease of administration offered by KVM. Since KVM virtual machines are created and maintained by ordinary processes on Linux, a VM essentially just looks like an ordinary long-running process, and can be launched, managed and terminated accordingly. By comparison, Xen requires the reconfiguration of the system's boot process to boot Xen before Linux, and requires its virtual machines to be managed using Xen-specific commands.

  • The higher flexibility of KVM. In particular, KVM features a clean separation between the hypervisor and the VMM, which is discussed below. This allows the same KVM hypervisor to power many different VMMs, which has unlocked a lot of innovation in the VMM space.

Background

What is a VMM? A system for hosting virtual machines traditionally is comprised of a hypervisor and a Virtual Machine Monitor (VMM). Though it is perhaps now common for the sum of these parts to be collectively referred to as a “hypervisor”, the distinction between these two parts is a useful and significant one. The distinction is as follows:

[A block diagram showing a block labelled 'Hypervisor' at the bottom, with one block 'VMM' above it and three identical blocks labelled 'VM' also above it, so that they are sitting on top of it. Arrows show I/O requests being made from the VMs to the hypervisor, and then forwarded from the hypervisor to the VMM. Conversely, I/O responses from the VMM are shown returning to the hypervisor and being forwarded back to the VMs.]
An environment for hosting virtual machines

A hypervisor is responsible for the isolation of the different virtual machines between themselves and from the host operating system, but (assuming a pedigree separation of hypervisor and VMM) does not actually handle service requests which may be made by the guest VMs (such as I/O requests) itself. Instead of handling such requests itself, those requests are trapped and passed to a VMM for handling. The VMM handles the request arbitrarily, then provides a response to the original request to the hypervisor, indicating that it can now continue executing the VM.

This has the interesting consequence that a hypervisor itself can be more or less free of policy, allowing the nature of how all I/O is handled for a VM to be defined by the VMM, not the hypervisor.

For example, modern x86 hardware-assisted virtualisation systems are more or less designed to enable guest VMs to run as though they are running on bare metal. Part of emulating the appearance of bare metal is, of course, simulating the various different I/O devices a real x86 PC would have — right down to the floppy disk controller. Rather than having the hypervisor provide simulation of these I/O devices, a VM's I/O devices are implemented by a VMM.

The separation of I/O related concerns into the VMM therefore has has the desirable effect of reducing the scope of the hypervisor, the correctness of which is security-critical, allowing for a smaller TCB; moreover, it also has the interesting effect that you have the option of using different VMMs with the same hypervisor. For example, one VMM might try to emulate a traditional PC with floppy disk controllers and so on; another VMM might be focused solely on providing the devices necessary to allow Linux to run and not try to “look like” an ordinary PC from a guest perspective.

The way that I/O requests are forwarded from a guest VM to the hypervisor, and from the hypervisor to the VMM, will inevitably vary by architecture and hypervisor, but for a typical example, a VMM might inform the hypervisor of ranges of guest-physical memory space corresponding to MMIO-based I/O devices that should have their accesses trap to the VMM. When the guest performs a load or store to/from this memory space, the hypervisor traps and forwards the memory request to the VMM; the VMM then processes the request and (for a load) returns the result of the load operation to the hypervisor. In this way, all I/O devices are simulated by the VMM — whether a virtio block device, an emulation of a 20 year old network controller which was once popular, an emulated optical drive, USB controller or SCSI HBA. Conversely this means that radically different personalities of machine can be built on top of a generic, policy-free hypervisor simply by implementing a different VMM.

VMMs and KVM. This brings us to KVM. The Linux kernel's KVM hypervisor functionality features a clean separation between hypervisor and VMM. What makes KVM interesting in particular is that a KVM VMM is simply an ordinary Linux process. Launching a hardware-virtualised VM on Linux can be done simply by spawning a VMM process, and that VM then runs for as long as the process runs. In this way, VMs can be spawned, managed and terminated just like ordinary Linux processes.

The most popular VMM for use with KVM is qemu, which can function both as an emulator and as a VMM to the KVM hypervisor. qemu began as an emulator; however, since an emulator also needs to emulate I/O devices as well as a CPU, qemu has also ended up with support for simulating a prodigious array of I/O devices. Since a VMM exists largely to implement I/O devices, this made qemu a natural basis for extension to function as a KVM VMM. The only real difference is how the compute (that is, non-I/O) component is provided; whether by software emulation in userspace or by asking KVM to run code under hardware virtualisation. Indeed, when launching QEMU, the only difference between running it as an emulator and running it as a KVM VMM is the -enable-kvm flag. There is otherwise little perceptible difference from a usage perspective, other than performance.

A VMM such as qemu begins by directing the hypervisor to create a hardware VM; after that, I/O requests are funnelled by the hypervisor to qemu and responded to in much the same way any kind of network daemon might exist to handle application requests and provide responses. KVM passes I/O requests to the process which asked for the VM to be created, which effectively supervises the VM. (Purely for purposes of illustration, there's really no reason you couldn't make a VMM which proxies all I/O requests from the hypervisor as HTTP requests to some endpoint. This is obviously a terrible idea and would perform awfully, but illustrates that, after initialisation, the VMM largely just functions as a server processing requests.)

Writing a VMM for KVM

Design of KVM. It has sometimes been remarked that Linux, or particularly its userspace ABI, is not so much designed as something that grows (or perhaps congeals). There are many aspects of the Linux userspace interface that leave something to be desired (for example, epoll). For this reason, I was quite positively surprised when I started investigating the the syscall interface between qemu as a VMM and Linux's KVM hypervisor, as KVM's design in this regard turns out to be surprisingly clean and easy to use.

You might expect that implementing a VMM more or less requires use of a systems language such as C or Rust. Interestingly, this turns out not to be the case. In actuality, the KVM API turns out to be simple enough to use that it is entirely feasible to write a VMM for Linux in Python.

So I decided to do that, just for the hell of it.

The KVM API. To begin with, we start using KVM by opening the device /dev/kvm. That's very easy to do:

fd = os.open('/dev/kvm', os.O_RDWR | os.O_CLOEXEC)

Incidentally, you don't have to be root to use KVM. This is another beneficial product of the hypervisor-VMM separation; since all I/O is handled by our VMM, which is simply a normal userspace process, creating a VM doesn't give us access to anything we don't already have access to. Thus, assuming the permissions on /dev/kvm haven't been changed, non-privileged processes are free to use KVM arbitrarily.

Once we have this device open (the KVM FD), most KVM operations are simply ioctl calls. For example, we can get the API version of the KVM API:

api_ver = fcntl.ioctl(fd, KVM_GET_API_VERSION)

(Here, KVM_GET_API_VERSION is a constant value found in Linux headers. I've gone to the trouble of defining these in Python, but for brevity, won't show those definitions here.)

There are various operations we can perform on the KVM device, but most of them just allow us to query the capabilities of the hypervisor on the current hardware. The interesting operation is KVM_CREATE_VM, which creates a new file descriptor, a VM FD. This means that after opening /dev/kvm, we can create arbitrarily many virtual machines, each represented by its own additional FD. The actual call is incredibly simple:

vm_fd = fcntl.ioctl(kvm_fd, KVM_CREATE_VM, 0)

—and now you have a VM.

There is one more FD we need before we can do anything interesting, which is a VCPU FD. A virtual machine can, of course, have multiple virtual CPUs attached to it, so each VCPU FD represents one virtual CPU in a virtual machine. These are created using the KVM_CREATE_VCPU operation, which is also very easy to use:

vcpu_idx = 0
vcpu_fd = fcntl.ioctl(vm_fd, KVM_CREATE_VCPU, vcpu_idx)

Here, vcpu_idx is just an ordinal identifier for the VCPU we can use to distinguish it.

Notice that the file descriptors created here form a logical tree of objects, which looks like this:

KVM Master FD (/dev/kvm)
  KVM VM FD
    KVM VCPU FD
    ...
  ...

Once one or more VCPUs have been created, we can actually cause a hypervisor to begin executing a VM. For its significance, it's remarkably simple:

ioctl(vcpu_fd, KVM_RUN, 0)

There are some interesting things going on here. Firstly, this call is synchronous; it doesn't return until the hypervisor traps back to the VMM due to some event occurring that requires VMM handling.

Secondly, if we want to attach multiple virtual CPUs to a VM, a VMM does this simply by spinning up multiple threads and VCPU FDs, and calling KVM_RUN on each of those threads for a corresponding VCPU FD. In other words, multi-VCPU VMs are literally implemented (from the VMM's perspective) using completely standard userspace threading primitives. This continues the theme of KVM allowing VMs to be managed as ordinary processes, and is a pleasant surprise. It's actually quite surprising how “ordinary” the KVM API is.

The need for a separate object for each VCPU (rather than, say, allowing a VMM to just call KVM_RUN on the same VM FD from multiple threads) is created by the fact that each VCPU has its own register and other state that needs to be tracked when control is returned to the VMM. For example, each VCPU stores the saved values of the architectural registers of that VCPU (so on x86, EAX, EBX, etc.) while the VCPU is not currently executing.

Suppose we have created one VM and one VCPU within that VM. There are some more operations we now need to invoke before we can usefully call KVM_RUN:

  • Operations to set up the VM:

    • KVM_CREATE_PIT2 and KVM_CREATE_IRQCHIP: Previously, I mentioned that the hypervisor implements almost no I/O devices itself and instead leaves that to the VMM. There are some small exceptions to this, most notably interrupt controllers. Interrupt controllers are emulated by the hypervisor itself, presumably for performance reasons. Invoking these operations on the VM FD tells KVM that we want to use the hypervisor's interrupt controller emulation.

    • KVM_SET_USER_MEMORY_REGION: This single operation is very important, as it allows us to map memory into the VM's guest physical address space. As you might expect, all VCPUs in a KVM VM share the same address space. The arguments to this operation include the desired guest physical address, the amount of memory, and a pointer to the memory in the VMM to be used to back it. In other words, providing guest-physical memory to the guest is as simple as allocating memory in the VMM in an ordinary manner and passing KVM a pointer to it. (Obviously, the memory does need to be page-aligned, so you would need to use mmap rather than malloc, but this is the only real complication. This is again a remarkably “normal” way of doing things.)

      This also means that we, the VMM, have direct access to the memory mapped into the VM, allowing us to easily simulate “DMA”, etc.

      The argument structure passed to this ioctl looks like this:

      struct kvm_userspace_memory_region {
        uint32_t slot;
        uint32_t flags;           /* e.g. READONLY */
        uint64_t guest_phys_addr;
        uint64_t memory_size;     /* bytes */
        uint64_t userspace_addr;  /* start of the userspace allocated memory */
      };

      You can configure multiple ranges of memory, each identified by a different “slot number” passed in this call.

    • KVM_SET_TSS_ADDR: This is an odd call apparently required only due to a quirk of Intel's hardware virtualisation extensions. See the KVM documentation here for details.

  • Operations to get and set the architectural state of the VCPU, such as KVM_(GET|SET)_(REGS|SREGS|FPU|MSR|LAPIC|CPUID2|GUEST_DEBUG):

    • REGS controls the architectural GPRs (on x86, EAX, EBX, etc.) and the instruction pointer.
    • SREGS controls the x86 segment registers.
    • FPU controls the FPU registers.
    • MSR controls the x86 model-specific registers (MSRs). Despite the name, most of these aren't actually model-specific. x86 MSR are numbered registers which can only be accessed in kernel mode, and are often used by kernels to control CPU operation. Thus, if we want to be able to virtualise arbitrary operating systems, we need to be able to simulate their operation like that of real hardware. Intel/AMD hardware virtualisation extensions provide special support for this, and this KVM API lets us configure this.
    • LAPIC allows us to control the state of the hypervisor-simulated x86 local interrupt controller which should be used when we start executing the VCPU.
    • CPUID2 allows us to configure what the x86 cpuid instruction reports when a guest VM executes it.
    • GUEST_DEBUG allows us to configure the CPU's architectural debug registers.
  • The KVM_IRQ_LINE operation on a VM FD, which allows us to flip IRQs on and off on the hypervisor-simulated interrupt controllers.

The above operations are basically all you need to write a KVM VMM. My own toy Python VMM is expressed solely in terms of the operations above, save for a few operations to query hardware capabilities at startup.

Memory-mapped region. There's really only one slight peculiarity to using the KVM API which we need to take care of to make a working VMM. The KVM hypervisor exposes a special region of shared memory to a VMM KVM. This allows the hypervisor and VMM to pass information more efficiently than if the VMM were to have to make repeated syscalls to modify VM state. In order to get access to this region of shared memory, we simply need to call mmap on the VCPU FD:

run_base = kvmapi.mmap(-1, map_length,
                       mmap.PROT_READ | mmap.PROT_WRITE,
                       mmap.MAP_SHARED, vcpu_fd, 0)

Since run_base is a raw pointer value, we need to do a bit of wrangling to get access to the region in Python, which I don't show here.

The run structure facilitates bidirectional transfer of information. We can set various fields in it before calling KVM_RUN; when KVM_RUN finally returns, we can examine it to determine why control has returned to the VMM. The KVM hypervisor writes information to this structure about why control was passed back to the VMM. There are a variety of exit reason codes, but some interesting ones are:

  • KVM_EXIT_IO, meaning the guest VM has executed an old-fashioned x86 port I/O instruction (inb, outb, etc.)

    When KVM_RUN is exited with this exit reason, it means the hypervisor is asking us to handle this request and provide the result of the operation. Thus the VMM can completely control the handling of all port I/O requests.

  • KVM_EXIT_MMIO, meaning the guest VM has attempted to access a memory address which hasn't been mapped to a guest-physical memory allocation using KVM_SET_USER_MEMORY_REGION. This works much the same as KVM_EXIT_IO; the VMM handles the request, writes the result to the run structure (e.g. for a read operation, the value read), and re-enters the VCPU using KVM_RUN. The kernel assumes such accesses are intended for some MMIO device implemented by the VM, hence the name. However the VMM can handle such an event however it likes.

A Python VMM. The above provides a pretty much complete picture of what is needed to write a VMM against the KVM API. There are some aspects which are slightly hairy to implement in Python, like handling mmap, but even this is not that hard to do. Ultimately the job of a VMM can be summarised by the following pseudocode:

Create VM and VCPU objects
Set initial VCPU state (registers, etc.)

Enter the VCPU
Loop forever:
  Handle the request which caused the VCPU to exit
    (usually a load/store of a MMIO address)
  Reenter the VCPU

My toy Python VMM was hacked together over a day or two, and successfully boots OVMF (which is the build of the reference implementation of UEFI intended for use in VMs; this is what qemu gives you if you ask for UEFI support) and Linux. It has a basic framebuffer which uses SDL2 as well as a basic virtio-scsi implementation. Even Debian's install ISO boots correctly in text mode, although it fails once it tries to install GRUB, probably due to OVMF trying to access some non-existent I/O device to update UEFI variables. No network devices are implemented.

You can find the code here.

Of course, nothing about this is intended for serious use, and the I/O devices implemented are just barely enough to get OVMF and Linux working, but hopefully people will find it interesting as a demonstration of just how accessible the KVM API actually is.

Ecosystem of KVM VMMs. In actuality, there's a plethora of KVM-based VMMs available for the Linux ecosystem nowadays. It appears that the good design of the KVM API, such that it can simply be used by an ordinary process, and its clean separation of hypervisor and VMM concerns has unlocked a lot of innovation in the VMM space:

  • Besides qemu, many alternative VMMs emphasise security by using memory-safe languages, such as rust-vmm, Amazon's Firecracker (also Rust) or Google's gvisor (which demonstrates that a KVM VMM can even be written in Go). These have the advantage of implementing a VMM's I/O devices in a memory safe language. Note that since a KVM VMM is just an ordinary userspace process, it can also be locked down and sandboxed in the same way any other userspace process can in case it gets compromised (for example due to a vulnerability in the implementation of an emulated I/O device). Many of these VMMs also do this to provide defence in depth.

  • Some of these VMMs also emphasise lightweight operation and are the basis of “micro-VMs”1. This demonstrates that the KVM API is very lightweight and comparable to the cost of a process; where a VM is considered “heavy”, it's really a product of weight added by the VMM, or even more likely, the guest operating system.

Playing with the demo. You can find basic usage instructions for Nix and non-Nix users in the README. If you don't have an OVMF image to hand, I've provided a pre-built image here (courtesy of Nixpkgs). Here's a video of it booting:

While the basic structure of the KVM API is the same for different architectures, the details necessarily differ in many areas from architecture to architecture; thus, this will only work on x86-64 machines.


See also:


1. Micro-VMs are basically the product of a desire to offer something container-like on Linux but which can serve as a safe security boundary for multi-tenant operations.

The inability of Linux to offer safe container-based isolation for multi-tenant environments is a notable product of Linux's technical debt in this area. Since protected memory architectures obviously provide all the necessary hardware support to facilitate isolation, there is no reason in principle containers cannot be fully and safely isolated under one kernel. In fact, competitors to the Linux ecosystem have successfully done so. For example, the Illumos Zones, which is Illumos's equivalent of containers, is designed to be secure in multi-tenant environments, even to the point where it was the basis for a container-based public multi-tenant cloud compute platform run by Joyent. Different tenants were scheduled in separate zones running under the same Illumos kernel successfully for many years, as Illumos's zones feature was designed from the outset to offer the necessary standard of assurance to enable safe multitenancy.

By comparison, this assurance has historically been lacking for Linux containers, which are not actually a real feature in terms of kernel functionality but are simply the assemblage of different kinds of namespacing. As the container ecosystem on Linux has become more developed, the more obscure aspects of system state requiring namespacing have become namespaced, but this was not always the case; for example, originally Linux containers could not safely contain a root (UID 0) user inside a container, whereas this is now possible due to the addition of user namespaces. As Bryan Cantrill once remarked, Linux isn't designed. This gradual and haphazard accumulation of different pieces of container-related functionality until something resembling an actually effective isolation solution can be assembled in userspace from the pieces feels illustrative of the differences between the approaches taken by Linux and by other *nix OSes (hence why the good design of the KVM API has been so surprising to me). While Linux containers now seem to resemble something actually secure even for multi-tenancy — as seems to be demonstrated by the willingness of cloud CI providers to run user CI code in containers with access to UID 0 — the hyperscalers appear more reticent to trust in this isolation, hence their turn instead to micro-VMs to offer an isolation barrier with container-like properties.