Running QEMU/KVM Virtual Machines in Unprivileged LXD Containers
Introduction
Sometimes you only have your laptop or a very limited environment but need to test quite complex setups that involve multiple hosts and network segments. For some purposes containers are good enough and LXD allows creating containers that give similar abstractions to full virtual machines. There are other scenarios where virtual machines are needed.
Containers and Kernel Isolation
In a simplified view, containers can be described as processes isolated via Linux kernel mechanisms. Mechanisms mentioned below can be used to create different varieties of isolation for containers in general - there is no strict definition on what should or should not be enabled:
- 7 kernel namespaces (mount, pid, user, ipc, network, uts, cgroup);
- cgroups - process grouping, resource limits (cpu, ram, network, block IO), device special file access (mknod, open), freezing tasks;
- capabilities (e.g. CAP_SYS_ADMIN is required to do mount(2) system call);
- LSM (AppArmor);
- seccomp - system call filtering using BPF by system call number and arguments;
- file system isolation: a container usually gets its own root file system. Root file system is a process property which can be changed using pivot_root(2) and is inherited by child processes;
- Process limits set via prlimit(2).
QEMU/KVM
QEMU is a userspace program to run virtual machines - an emulator. KVM is a kernel module that allows userspace processes to utilize Intel (VT-x) or AMD (AMD-V) virtualization technologies present in almost every modern CPU to avoid some overhead associated with full emulation done by QEMU (memory management, interrupts, timers etc.). QEMU can utilize KVM using ioctl(2) interface it provides via a character special file /dev/kvm. This is what people call “QEMU/KVM” or just “KVM”.
There are other technologies that remove even more emulation overhead:
- vhost is used to move user-space device emulation implementation of the data plane to the kernel space (vhost, vhost_net, vhost_scsi, vhost_vsock kernel modules) and avoid system call overhead. vhost_vsock specifically allows a guest to use special sockets to communicate with a hypervisor (host) more efficiently and with less modifications than with serial devices;
- vhost-user is used to move user-space device emulation implementation of the data plane to a different userspace application. This is mainly used for user-space driver implementations that bypass the kernel stack completely - DPDK- or SPDK-based applications are a good example. For networking this is used in OVS or Snabb to speed up packet processing by avoiding extra context and mode-switches for interrupt processing. For storage the problem is similar: with fast NVMe devices too many interrupts are generated in which case CPU becomes a bottleneck. Such technologies utilize memory locking and huge pages with dedicated threads isolated from load-balancing by the kernel scheduler to process ring buffers.
Unprivileged Containers
With the above in mind it might seem like we absolutely need a privileged container to run virtual machines. This is not entirely true because there are different kinds of privileges. In the Linux world there are privileged processes as capabilities(7) mentions:
For the purpose of performing permission checks, traditional UNIX implementations distinguish two categories of processes: privileged processes (whose effective user ID is 0, referred to as superuser or root), and unprivileged processes (whose effective UID is nonzero). Privileged processes bypass all kernel permission checks, while unprivileged processes are subject to full permission checking based on the process’ credentials (usually: effective UID, effective GID, and supplementary group list).
LXC security page is fairly clear and coherent with that definition but for containers:
Privileged containers are defined as any container where the container uid 0 is mapped to the host’s uid 0. In such containers, protection of the host and prevention of escape is entirely done through Mandatory Access Control (apparmor, selinux), seccomp filters, dropping of capabilities and namespaces.
Unprivileged containers are safe by design. The container uid 0 is mapped to an unprivileged user outside of the container and only has extra rights on resources that it owns itself.
It is clear that in an unprivileged container a separate user namespace is created while in a privileged container this is not the case (in other words, CLONE_NEWUSER is either used or not used in either clone(2) or unshare(2) system calls).
How does that help with creating virtual machines?
Virtual Machines in Unprivileged Containers
- clone(2) or fork(2) system calls can be used to create new processes and exec family of system calls can be used to execute new binaries in unprivileged containers just fine and a QEMU binary falls into that category;
- QEMU/KVM expects a few kernel modules to be loaded, mainly
kvm
,kvm_intel
orkvm_amd
andtap
. Performance-wise, depending on your setupvhost
,vhost_net
,vhost_scsi
andvhost_vsock
modules. If VFIO needs to be used at leastvfio
andvfio-pci
are also needed but this requires host sysfs access as well which is a bit more involved (you may also need to load and control a hardware device driver via its own character special files); - QEMU/KVM needs access to a number of character special files
/dev/kvm
,/dev/net/tun
,/dev/vhost-net
,/dev/vhost-scsi
,/dev/vhost-vsock
. For VFIO/dev/vfio/vfio
and potentially other driver-specific character special files. - Libvirt daemon manages QEMU processes and they go through a daemonization procedure to stay running even if libvirtd exits. Libvirt uses some kernel functionality, including
bridge
module and cgroups;
Other than VFIO-related modules and character files or customizations required for usage of functionality such as huge pages, there is not a lot to enable.
Character special files can be used to run module-specific operations via a generic ioctl(2) interface. Some ioctls require certain capabilities(7) to be present but this is module-specific - there has to be special code in a kernel module to enforce that. There is no explicit access control besides file permissions or, if present, LSM-based mandatory access control (e.g. AppArmor).
For that reason, provided that modules are loaded by a privileged user before a container starts or at its runtime, there are no barriers to running accelerated virtual machines in containers.
Outdated Requirements
There are some outdated requirements with regards to character special files which should be ignored:
/dev/kqemu
/dev/rtc
/dev/hpet
kqemu is long gone, same with rtc and hpet.
Breaking Ground
LXD supports a number of useful ways to configure containers, including a way to preseed cloud-init network-data or user-data for Ubuntu cloud-images that have certain instrumentation in container templates. The following functionality will be used:
- LXD profiles;
- cloud-image templates and cloud-init network-data and user-data;
- LXD-side pre-loading of kernel modules before containers are started;
- ability to pre-create character special files for containers;
- storage pools.
Below is the template that can be used to configure an LXD profile:
Basic virt-host-validate
checks pass when used with the profile above.