Passing lots of PCIe devices to a KVM guest

I was experimenting to see if there are any practical limits to passing a large number (>64) of PCIe devices to a QEMU/KVM guest in Ubuntu 18.04. My understanding is that I should be able to approach 256 (minus slots used by emulated devices), but I wanted to see if there were any practical limitations. Ultimately I was able to demonstrate 160 passthrough devices to a single guess w/o hitting a hard limitation.

Host Setup

My system is a Nvidia DGX-2 system with 10 Mellanox Connect-X 5 controllers. It is running Ubuntu 18.04 (4.15 GA kernel) and version 4.6 of the Mellanox OFED drivers. (I don’t know that the OFED drivers are necessary, but I had to install the OFED stack to configure firmware VFs, so I left them). I’m passing the following arguments to the kernel to configure the IOMMU:

intel_iommu=on vfio_iommu_type1.allow_unsafe_interrupts=1 iommu=pt

Configuring VFs in Mellanox Firmware

By default, the ConnectX-5 adapters in this system had VFs disabled in firmware. I activated 16 VFs per device these using the mlxconfig tool. The following shell code is a quick hack to do this for every (non-virtual) Mellanox device in the system:

$ sudo mst start
$ lspci | grep Mellanox | grep -v "Virtual Function" | cut -d' ' -f1 | \
while read bsf; do \
  sudo mstconfig -y -d $bsf set SRIOV_EN=1 NUM_OF_VFS=16 \
done

Instantiating the VFs in Linux

Once the firmware was updated – and I’d rebooted so that it took effect – I needed to create the VF devices in Linux. In order to make these devices persistent across reboots, I installed the sysfsutils package and wrote a script to generate a config file for it:

$ cat gensysfscfg.sh
#!/bin/sh

for path in /sys/class/net/enp*; do
    if [ "$(basename $(readlink $path/device/driver))" != "mlx5_core" ]; then
	continue
    fi
    if [ ! -f $path/device/mlx5_num_vfs ]; then
	continue
    fi

    if [ ! -f $path/device/sriov_numvfs ]; then
	echo "Error: $path/device/sriov_numvfs does not exist" 1>&2
    fi

    echo "$(echo $path | cut -d/ -f3-)/device/sriov_numvfs = 8"
done
$ ./gensysfscfg.sh | sudo tee /etc/sysfs.d/mlnx-vfs.conf
class/net/enp134s0f0/device/sriov_numvfs = 8
class/net/enp134s0f1/device/sriov_numvfs = 8
class/net/enp184s0/device/sriov_numvfs = 8
class/net/enp189s0/device/sriov_numvfs = 8
class/net/enp225s0/device/sriov_numvfs = 8
class/net/enp230s0/device/sriov_numvfs = 8
class/net/enp53s0/device/sriov_numvfs = 8
class/net/enp58s0/device/sriov_numvfs = 8
class/net/enp88s0/device/sriov_numvfs = 8
class/net/enp93s0/device/sriov_numvfs = 8

To start, I only used 8 of the 16 configured in firmware. I then restarted sysfsutils for it to take effect:

sudo service sysfsutils restart

That caused a lot of kernel messages & new devices appearing under /sys/class/net.

Creating the Guest

I used the uvtool package to create an initial working guest and configured it from there. Note that uvtool will copy in your ssh keys automatically so you can ssh into the guest, so be sure to generate those ahead of time. Sorry for the lack of command output here, I’m just going by memory:

$ sudo apt install uvtool
$ uvt-simplestreams-libvirt --verbose sync arch=amd64 release=bionic
$ ssh-keygen
$ uvt-kvm create test
$ virsh console test

Because I wanted to use a UEFI guest, I also configured that (but this shouldn’t be necessary):

$ virsh shutdown test
$ sudo apt install ovmf
$ virsh edit test

I then edited added the loader & nvram elements as shown below:

  <os>
    <type arch='x86_64' machine='pc-i440fx-bionic'>hvm</type>                   
    <loader readonly='yes' type='pflash'>/usr/share/OVMF/OVMF_CODE.fd</loader>
    <nvram>/var/lib/libvirt/qemu/nvram/test_VARS.fd</nvram>
    <boot dev='hd'/>
  </os>

EDIT: Previously there was a “cp /usr/share/OVMF/OVMF_VARS.fd /var/lib/libvirt/qemu/nvram/test_VARS.fd” command included above, but it was pointed out to me that this is not necessary – libvirt will automagically copy the template for you.

Finally, lets pass through the devices. I created the following script to look for all Mellanox virtual functions and generate <hostdev> entries for them:

#!/bin/sh

emit_snippet() {
    domain="$1"
    bus="$2"
    slot="$3"
    function="$4"

    echo "    <hostdev mode='subsystem' type='pci' managed='yes'>"
    echo "      <source>"
    echo "        <address domain='$1' bus='$2' slot='$3' function='$4'/>"
    echo "      </source>"
    echo "    </hostdev>"
}

for path in /sys/class/net/enp*; do
    if [ "$(basename $(readlink $path/device/driver))" != "mlx5_core" ]; then
	continue
    fi
    if [ -f $path/device/mlx5_num_vfs ]; then
	continue
    fi
    dev="$(basename $path)"
    bsf="$(basename $(readlink /sys/class/net/$dev/device))"
    domain="$(echo $bsf | cut -d: -f1)"
    bus="$(echo $bsf | cut -d: -f2)"
    slotfunction="$(echo $bsf | cut -d: -f3)"
    slot="$(echo $slotfunction | cut -d. -f1)"
    function="$(echo $slotfunction | cut -d. -f2)"

    emit_snippet 0x${domain} \
    		 0x${bus} \
    		 0x${slot} \
     		 0x${function}
done

I inserted the output of that into the <devices> section of my guest XML.

$ virsh edit test
Domain test XML configuration edited.

## Insert the new <hostdev> lines in <devices>

$ sudo virsh define /etc/libvirt/qemu/test.xml
Domain test defined from /etc/libvirt/qemu/test.xml

I started up the guest, which took on the order of 10 seconds:

$ virsh start test; virsh console test

Watching the console, everything seemed to be going OK until:

[   20.145675] mlx5_core 0000:01:04.0: firmware version: 16.25.1020
[   20.152070] random: systemd-udevd: uninitialized urandom read (16 bytes read)
[   20.765671] mlx5_core 0000:01:05.0: firmware version: 16.25.1020
[   21.373804] mlx5_core 0000:01:06.0: firmware version: 16.25.1020
[   21.974761] mlx5_core 0000:01:07.0: firmware version: 16.25.1020
[   22.591561] mlx5_core 0000:01:08.0: firmware version: 16.25.1020
[   23.203812] mlx5_core 0000:01:09.0: firmware version: 16.25.1020
[   23.816257] mlx5_core 0000:01:0a.0: firmware version: 16.25.1020
[   24.424466] mlx5_core 0000:01:0b.0: firmware version: 16.25.1020
[   25.048853] mlx5_core 0000:01:0c.0: firmware version: 16.25.1020
[   25.667828] mlx5_core 0000:01:0d.0: firmware version: 16.25.1020
[   26.274705] mlx5_core 0000:01:0e.0: firmware version: 16.25.1020
[   26.857318] Interrupt reservation exceeds available resources
[   26.894088] mlx5_core 0000:01:0f.0: firmware version: 16.25.1020
[   27.490493] mlx5_core 0000:01:0f.0: mlx5_start_eqs:733:(pid 161): failed to create async EQ -28
[   27.504074] mlx5_core 0000:01:0f.0: Failed to start pages and async EQs
[   27.930859] mlx5_core 0000:01:0f.0: mlx5_load_one failed with error code -28
[   27.936661] mlx5_core: probe of 0000:01:0f.0 failed with error -28
[   27.938305] mlx5_core 0000:01:10.0: firmware version: 16.25.1020
[   28.531192] mlx5_core 0000:01:10.0: mlx5_start_eqs:733:(pid 161): failed to create async EQ -28

The guest completed boot, but not all of the devices were available:

$ uvt-kvm ssh test ls /sys/class/net | wc -l
39

Looking at the guest’s /proc/interrupts file, the problem appeared to be that we only had 1 vcpu, and all of its interrupts were in-use. So, I bumped up the number of guest vcpus in the XML to 4, and things looked better until:

[   34.852454] mlx5_core 0000:01:1b.0: firmware version: 16.25.1020
[   35.489797] mlx5_core 0000:01:1c.0: firmware version: 16.25.1020
[   36.130494] mlx5_core 0000:01:1d.0: firmware version: 16.25.1020
[   36.252079] systemd-udevd invoked oom-killer: gfp_mask=0x14002c0(GFP_KERNEL|__GFP_NOWARN), nodemask=(null), order=0, oom_score_adj=0
[   36.257492] systemd-udevd cpuset=/ mems_allowed=0
[   36.259687] CPU: 0 PID: 166 Comm: systemd-udevd Not tainted 4.15.0-66-generic #75-Ubuntu
[   36.263015] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
[   36.266320] Call Trace:
[   36.267390]  dump_stack+0x63/0x8e
[   36.268635]  dump_header+0x71/0x285

Yeah, uvtool’s default of 512M of memory isn’t nearly enough for this, so I bumped the guest XML up to 8G:

  <memory unit='KiB'>8388608</memory>
  <currentMemory unit='KiB'>8388608</currentMemory>

After that, things booted fine. I took a look inside the guest. Do we see all the interfaces?

ubuntu@test:~$ lspci | grep Mellanox | wc -l
80
ubuntu@test:~$ ls -d /sys/class/net/en* | wc -l
81

That’s all of them, plus the built-in. Only 2 physical nics are wired up, so we should see links on 2×8 VFs, and of course the built-in:

ubuntu@test:/sys/class/net$ for iface in en*; do \
  sudo ip link set dev $iface up; 
  sudo ethtool $iface | grep "Link detected: yes"; \
done | wc -l
17

Can we do more?

I went back and updated the sysfsutils config to expose all 16 VFs per device. And yes, we can do more:

$ lspci | grep Mellanox | wc -l
160
$ ls -d /sys/class/net/en* | wc -l
161

Note that I did have to bump up guest vcpus to > 4 (I chose 16). In theory, we could go even further by bumping up the firmware VF counts > 16.