Using the NVMe server with a Linux guest VM =========================================== This tutorial will show you how the L4Re NVMe driver can be used to make an NVMe namespace or a GPT partition on an NVMe namespace accessible in the form of a VIRTIO block device to a Linux VM. We will use our knowledge from the previous tutorials, namely the :doc:`linuxvm` tutorial and to a degree the :doc:`hwpassthrough` tutorial. You are kindly reminded to get familiar with the concepts and procedures presented in them before moving on with this one. As the NVMe technology is bound to systems with PCIe, we will be using the AMD64 platform (also known as x86_64), where PCIe is most common. This means that certain steps from the above tutorials do not apply verbatim, but need to be modified for the AMD64/x86_64 use case. In most situations we will leave this as an exercise to the reader, but will bring your attention to important differences in configuration. The Big Picture --------------- In the following we will run a simple Linux VM in *uvmm*. The Linux VM will be provided with a VIRTIO block device by the L4Re NVMe server. The VIRTIO device will be backed by an NVMe namespace. The VM will be able to mount the VIRTIO block device and use it just as any other block device. The whole scenario will run in KVM for the sake of a unified user experience and reusability of this tutorial. Building everything ------------------- You first need to checkout and build the sources for the following components as described in the :doc:`linuxvm` tutorial: - Fiasco - L4Re - Linux - Busybox Without going into details described in the previous tutorials, the building is done like this: :: [somedir] $ cd fiasco [somedir/fiasco] $ make B=../build-fiasco-x86_64 [somedir/fiasco] $ cd ../build-fiasco-x86_64 [somedir/build-fiasco-x86_64] $ make config # save defaults for x86_64 [somedir/build-fiasco-x86_64] $ make -j 6 [somedir/build-fiasco-x86_64] $ cd ../l4 [somedir/l4] $ make B=../build-x86_64 [somedir/l4] $ cd ../build-x86_64 [somedir/build-x86_64] $ make config # save default for x86_64 [somedir/build-x86_64] $ make -j 6 [somedir/build-x86_64] $ cd ../linux [somedir/linux] $ mkdir ../build-linux-x86_64 [somedir/linux] $ make O=../build-linux-x86_64 defconfig [somedir/linux] $ make O=../build-linux-x86_64 -j 6 [somedir/linux] $ cd ../busybox [somedir/busybox] $ mkdir ../build-busybox-x86_64 [somedir/busybox] $ make O=../build-busybox-x86_64 defconfig [somedir/busybox] $ make O=../build-busybox-x86_64 menuconfig # select Build static binary (no shared libs) [somedir/busybox] $ cd ../build-busybox-x86_64 [somedir/build-busybox-x86_64] $ make install -j 6 In this case we are assuming the host and target architecture is x86_64. Preparing the root file system ------------------------------ In this step you will make changes to the installed Busybox file system that will later allow you to see mounted file systems in the running Linux VM and list the discovered VIRTIO block devices. :: [somedir/build-busybox-x86_64] $ mkdir _install/proc [somedir/build-busybox-x86_64] $ mkdir _install/etc [somedir/build-busybox-x86_64] $ cat >_install/etc/fstab < ../../ramdisk/ramdisk-x86_64.cpio.gz [somedir/build-busybox-x86_64/_install] $ cd ../.. Creating an empty VIRTIO block image ------------------------------------ Create an empty 10MB file to hold an NVMe namespace image for QEMU: :: [somedir] $ mkdir images [somedir] $ dd if=/dev/zero of=images/disk.img bs=1M count=10 You can tweak the count argument to change the size of the image in MB according to your liking. Configuring L4Re ---------------- Now is time to create the configuration files for your scenario. First of all, create a directory that will contain all configuration: :: [somedir] $ mkdir conf We will now proceed to create two configuration files in this directory: - Vbus configuration file for passing the NVMe hardware to the NVMe server - ned script to start the scenario Vbus configuration file ~~~~~~~~~~~~~~~~~~~~~~~ Create *somedir/conf/nvme.vbus* like this: .. code:: lua Io.add_vbusses { nvme = Io.Vi.System_bus(function () Property.num_msis = 2; PCI0 = Io.Vi.PCI_bus(function () pci_hd = wrap(Io.system_bus():match("PCI/storage")); end); end); }; When processed by *io*, this configuration will create a vbus capability called *nvme* and, for the sake of simplification, via this capability make accessible all PCI storage devices found on the host system. Note that this assignment is a little bit too generous (it would match all non-NVMe storage devices too), but for your scenario it will do with a remark that it is possible to match the real hardware exactly. In addition, this configuration specifies that the *nvme* vbus will support up to 2 MSI interrupts. One is for the NVMe admin queues and one is for the I/O queue pair of each active NVMe namespace. In general you want to have this limit large enough to accomodate all your NVMe namespaces with connected clients and the admin queues. If MSI interrupts are not configured or are not available, the NVMe server will fall back to using legacy interrupts. Writing the ned script ~~~~~~~~~~~~~~~~~~~~~~ Create *somedir/conf/uvmm-nvme.ned* as follows: .. code:: lua package.path = "rom/?.lua"; local L4 = require "L4"; local vmm = require "vmm"; local l = L4.default_loader; local nvmecl = l:new_channel(); local io_busses = { nvme = 1, } vmm.start_io(io_busses, "rom/nvme.vbus -v"); l:start( { caps = { vbus = io_busses.nvme, cl1 = nvmecl:svr() }, log = { "nvme", "g" }, }, "rom/nvme-drv" .. " -v" .. " --client cl1" .. " --device MYNVMECTLSN:n1" .. " --ds-max 5" ); vmm.start_vm({ id = 1, mem = 128, mon = false, rd = "rom/ramdisk-x86_64.cpio.gz", fdt = "rom/virt-pc.dtb", bootargs = "console=ttyS0 earlyprintk=1 rdinit=/bin/sh", kernel = "rom/bzImage", log = L4.Env.log, ext_caps = { qdrv = nvmecl, }, }); The above ned script first starts *io* via the *vmm* convenience wrapper and passes it the vbus configuration file *nvme.vbus*. This instructs *io* to create a vbus called *nvme* according to the configuration file. It then starts the NVMe server, which goes by name *nvme-drv*. Its *vbus* capability is the client capability for the *nvme* vbus as created by *io*. *cl1* is a server capability for the NVMe server’s client that will be eventually handed over to *uvmm* for use by the Linux VM. The client is assigned a VIRTIO block device identified by the controller’s serial number (in our case this is *MYNVMECTLSN*) and the number of the NVMe namepspace, in this case namespace 1. If you run this tutorial on bare metal, make sure to change the serial number to match your hardware. Note that the *device* command line argument can instead identify a GPT partition by its GUID or label. In that case the client will be able to access only that particular partition. Finally, *uvmm* is started using the *vmm* convenience wrapper. In constrast to the previous tutorials (which were for the aarch64 architecture), the arguments are tailored for running on x86_64. This shows in the used device tree (*virt-pc.dtb*) and the Linux image name (*bzImage*). Also note that the *-i* command line argument is not used here. The *virt-pc.dtb* device tree contains a node that allows a VIRTIO block device to be assigned to the VM when a capability named *qdrv* is defined. We define it using the *nvmecl* capability, which is the client capability for the NVMe server client. Specifying boot modules ~~~~~~~~~~~~~~~~~~~~~~~ Add an entry to *somedir/l4/conf/modules.list* with the following content in order to inform the build system and the boot loader of all the files that form your scenario. :: entry uvmm-nvme kernel fiasco -serial_esc roottask moe rom/uvmm-nvme.ned module uvmm module io module nvme-drv module nvme.vbus module l4re module ned module virt-pc.dtb module ramdisk-x86_64.cpio.gz module[shell] echo $SRC_BASE_ABS/pkg/uvmm/configs/vmm.lua module uvmm-nvme.ned module[uncompress] bzImage Mind the device tree and Linux boot image names specific to x86_64. Creating Makeconf.boot ~~~~~~~~~~~~~~~~~~~~~~ You’ll now need to configure the build system. Go to *somedir/l4/conf* and copy *Makeconf.boot.example* to *Makeconf.boot* (assuming it does not exist yet). Then edit *Makeconf.boot* and make sure that: - somedir/build-fiasco-x86_64 - somedir/ramdisk - somedir/conf - somedir/build-linux-x86_64/arch/x86/boot/ are all included in the definition of *MODULE_SEARCH_PATH*. The individual paths need to be absolute and spearated by colons. An example definition of *MODULE_SEARCH_PATH* might look as follows. Just make sure to change *SOMEDIR* according to your environment: :: SOMEDIR = /absolute/path/to/your/somedir MODULE_SEARCH_PATH = $(SOMEDIR)/build-fiasco-x86_64:$\ $(SOMEDIR)/ramdisk:$(SOMEDIR)/conf:$(SOMEDIR)/build-linux-x86_64/arch/x86/boot At the same time, make sure that *QEMU_OPTIONS* for amd64 are defined in the following fashion: :: QEMU_OPTIONS-amd64 = -cpu host -enable-kvm -m 512 -display none QEMU_OPTIONS-amd64 += -drive if=none,file=$(SOMEDIR)/images/disk.img,format=raw,id=D1 \ -device nvme,drive=D1,serial=MYNVMECTLSN Running the scenario ~~~~~~~~~~~~~~~~~~~~ If the above steps were made correctly, you should be able to start the scenario like this: :: [somedir] $ cd build-x86_64 [somedir/build-x86_64] $ make E=uvmm-nvme qemu After a short while the Linux VM will have booted into its interactive shell. Let us first explore the output produced by the NVMe server though. It will be interleaved with output from other components, but when collected it should look like this: :: nvme | NVMe[]: NVMe driver says hello. nvme | NVMe[]: Starting device discovery. nvme | NVMe[]: ICU info: features=80000000 #Irqs=24, #MSIs=2 nvme | NVMe[]: MSI info: vector=0x80000000 addr=fee00000, data=4022 nvme | NVMe[]: Device: interrupt : 80000000 trigger: 1, polarity: 0 nvme | NVMe[]: All devices scanned. nvme | Serial Number: MYNVMECTLSN nvme | Model Number: QEMU NVMe Ctrl nvme | Firmware Revision: 1.0 nvme | Controller ID: 0 nvme | SGL Support: yes nvme | Number of Namespaces: 256 nvme | NVMe[]: MSI info: vector=0x80000001 addr=fee00000, data=4023 nvme | Making NSID 1 visible to clients nvme | libblock[]: No GUID partition header found. nvme | NVMe[]: Capability 'svr' not found. No dynamic clients accepted. nvme | Registering dataspace from 0x0 with 131072 KiB, offset 0x0 nvme | PORT[0x23090]: DMA guest [0-7ffffff] local [1200000-91fffff] offset 0 nvme | register client: host IRQ: 42a nvme | Resetting device nvme | shm=[0-7ffffff] local=[1200000-91fffff] desc=[1f20000-1f21000] (0x3120000-0x3121000) nvme | shm=[0-7ffffff] local=[1200000-91fffff] avail=[1f21000-1f21206] (0x3121000-0x3121206) nvme | shm=[0-7ffffff] local=[1200000-91fffff] used=[1f21240-1f21a46] (0x3121240-0x3121a46) nvme | VQ[0x230f0]: num=256 d:0x3120000 a:0x3121000 u:0x3121240 This suggests the NVMe server discovered the NVMe controller emulated by QEMU. Note the serial number matches the one passed to QEMU via its command line argument. Also note that the server is using MSI interrupts. The controller reports 256 NVMe namespace, but only namespace 1 is used. It is the one which contains the disk image passed to QEMU. The disk image we prepared has no GPT partitions which is in line with what the NVMe server reports. We see that a client registered with the server and shared VIRTIO queue and buffer memory with it. When we now look at the Linux VM, it will have detected the VIRTIO block device which corresponds to namespace 1 on the NVMe controller. :: ~ # dmesg | grep virtio_blk [ 0.606814] virtio_blk virtio1: 1/0/0 default/read/poll queues [ 0.776591] virtio_blk virtio1: [vda] 20480 512-byte logical blocks (10.5 MB/10.0 MiB) Linux will designate this VIRTIO block device as */dev/vda*: :: ~ # mount -a ~ # mount rootfs on / type rootfs (rw,size=45176k,nr_inodes=11294) none on /proc type proc (rw,relatime) devtmpfs on /dev type devtmpfs (rw,nosuid,relatime,size=45200k,nr_inodes=11300,mode=755) ~ # ls /dev/vda /dev/vda You can now create a file system on this device: :: ~ # mkfs.ext2 /dev/vda Filesystem label= OS type: Linux Block size=1024 (log=0) Fragment size=1024 (log=0) 2560 inodes, 10240 blocks 512 blocks (5%) reserved for the super user First data block=1 Maximum filesystem blocks=262144 2 block groups 8192 blocks per group, 8192 fragments per group 1280 inodes per group Superblock backups stored on blocks: 8193 And mount it: :: ~ # mkdir /mnt ~ # mount /dev/vda /mnt ~ # mount rootfs on / type rootfs (rw,size=45176k,nr_inodes=11294) none on /proc type proc (rw,relatime) devtmpfs on /dev type devtmpfs (rw,nosuid,relatime,size=45200k,nr_inodes=11300,mode=755) /dev/vda on /mnt type ext2 (rw,relatime) Conclusion ---------- This tutorial has taught you to configure an L4Re scenario which allows you to export an NVMe namespace (or partition) as a VIRTIO block device to a Linux VM. The Linux VM sees this device as */dev/vda* and can use it as any other block device.