Virtiofs is a shared file system that lets virtual machines access a directory tree on the host. Unlike existing approaches, it is designed to offer local file system semantics and performance.
File systems can be classified as local or remote. A local file system can only be mounted by one host at a time. Processes accessing a local file system simultaneously can rely on strong coherency because all accesses go through a single mount.
Remote file systems such as NFS allow multiple hosts to mount a file system simultaneously. Offering strong coherency is expensive due to the communication overhead between the client and the server. As a result, remote file systems often offer weaker coherency than local file systems.
Virtualization allows multiple virtual machines (VMs) to run on a single physical host. Although VMs are isolated and run separate operating system instances, their proximity on the physical host allows for fast shared memory access. Both the semantics and the performance of communication of co-located VMs are different from the networking model for which remote file systems were designed.
Existing shared file systems for VMs use remote file systems such as NFS, 9P, or custom RPC protocols with similar characteristics. It is possible to design a new file system that takes advantage of the proximity of VMs to achieve semantics and performance more like local file systems. This is desirable both for performance and for application compatibility.
The most basic use case for shared file systems for VMs is the ability to share a directory between the VM and the hypervisor. This is a common requirement during VM provisioning when files must be available to the VM for installation. Another common requirement is ad-hoc sharing of directories since it is more convenient and performant than copying files between the VM and hypervisor over the network.
During test and development it can be easier to work with a directory tree than a disk image file. Files edited on the hypervisor are immediately visible to the VM through a shared file system. VMs can even boot from a shared root file system, making the development and test cycle extremely quick since no disk images need to be built and files do not need to be extracted after testing.
It is expected that virtualization management tools will offer the ability to share directories between the hypervisor and VMs using virtiofs.
It is good practice to hide network storage and distributed storage systems from VMs. This has security benefits since VMs do not require access to storage networks or credentials. It also becomes possible to roll out new backend storage systems without reconfiguring all VMs.
It is expected that cloud management infrastructure will configure virtiofs for VMs so that backend storage systems like Ceph, NFS, or GlusterFS can be accessed.
Lightweight VMs and container VMs require fast VM access to container images or root file systems on the hypervisor with minimal memory footprint. Applications running inside the VM may depend on local file system semantics and be incompatible with 9P or other remote file systems.
It is expected that lightweight VM and container VM management tools will share files using virtiofs with tuning to reduce the VM memory footprint.
The following figure shows the components involved in virtiofs:
The virtiofsd file system daemon runs on the hypervisor and handles FUSE protocol requests from the VM. The Linux FUSE protocol provides a vocabulary of file system operations. FUSE was chosen because it is a mature protocol that closely models the Linux VFS, making it possible to provide the semantics of a local file system. The FUSE protocol evolves alongside the Linux source code without a lengthy standardization process, making it suitable for rapidly developing new features.
File I/O is performed on behalf of the VM by virtiofsd using system calls. An underlying file system on the hypervisor performs the file I/O. The underlying file system can be a local file system or a remote file system.
The VM can memory map contents of files through the DAX Window that virtiofs provides. Mappings are set up through FUSE requests to virtiofsd, which then communicates with QEMU to establish the memory mapping for the VM. This interaction with QEMU is necessary because KVM uses the virtual memory of the QEMU process for memory address translation. The VM can remove mappings in a similar fashion.
The virtiofs VIRTIO device is implemented in QEMU but the VM communicates directly with the vhost-user device backend in virtiofsd for most operations. This allows virtiofsd to run as a separate process from QEMU and with its own sandboxing.
The FUSE client inside the VM must support the virtiofs protocol extensions and implement the virtiofs VIRTIO device specification.
QEMU implements the virtiofs VIRTIO device specification and delegates most operations to a vhost-user device backend so that the file system can execute as a separate process.
virtiofsd is a vhost-user device backend that implements the file system operations.
Although virtiofs uses FUSE as the protocol, it does not function as a new transport for existing FUSE applications. It is not possible to run existing FUSE file systems unmodified because virtiofs has a different security model and extends the FUSE protocol.
Existing FUSE file systems trust the client because it is the kernel. There would be no reason for the kernel to attack the file system since the kernel already has full control of the host. In virtiofs the client is the untrusted VM and the file system daemon must not trust it. Therefore, virtiofsd uses a hardened FUSE implementation that does not trust the client.
The DAX Window is an extension to the FUSE protocol that supports memory mapping the contents of files. The virtiofs VIRTIO device implements this as a shared memory region exposed through a PCI BAR. This feature is virtualization-specific and is not available outside of virtiofs.
Additional FUSE protocol extensions are expected for future virtiofs features.
The virtiofs VIRTIO device is the interface through which the VM and the hypervisor communicate. Each virtiofs device exports one shared file system and is identified by a name called a tag. Multiple virtiofs devices can be added to a VM and mounted by tag.
The size of the DAX Window can be configured depending on available VM address space and memory mapping requirements. Best performance is achieved when file contents are fully mapped, eliminating the need for communication with virtiofsd for file I/O. A small DAX Window can be used but incurs more memory mapping setup/removal overhead. DAX can be disabled completely, resulting in operation similar to remote file systems where every operation requires communication.
The virtiofs design is not specific to a hypervisor or file system daemon implementation. It is possible to create alternative implementations based on the virtiofs VIRTIO device specification.
The hypervisor must implement the virtiofs VIRTIO device and typically delegates the file system operation to a vhost-user device backend. It is also possible to forego vhost-user and emulate the virtiofs device directly inside the hypervisor, although this may result in poor isolation and security.
The vhost-user device backend implements the bulk of virtiofs. It is possible to use the vhost-user device backend with any hypervisor that supports the vhost-user interface.