Passing lots of PCIe devices to a KVM guest

I was experimenting to see if there are any practical limits to passing a large number (>64) of PCIe devices to a QEMU/KVM guest in Ubuntu 18.04. My understanding is that I should be able to approach 256 (minus slots used by emulated devices), but I wanted to see if there were any practical limitations. Ultimately I was able to demonstrate 160 passthrough devices to a single guess w/o hitting a hard limitation.

Host Setup

My system is a Nvidia DGX-2 system with 10 Mellanox Connect-X 5 controllers. It is running Ubuntu 18.04 (4.15 GA kernel) and version 4.6 of the Mellanox OFED drivers. (I don’t know that the OFED drivers are necessary, but I had to install the OFED stack to configure firmware VFs, so I left them). I’m passing the following arguments to the kernel to configure the IOMMU:

intel_iommu=on vfio_iommu_type1.allow_unsafe_interrupts=1 iommu=pt

Configuring VFs in Mellanox Firmware

By default, the ConnectX-5 adapters in this system had VFs disabled in firmware. I activated 16 VFs per device these using the mlxconfig tool. The following shell code is a quick hack to do this for every (non-virtual) Mellanox device in the system:

$ sudo mst start
$ lspci | grep Mellanox | grep -v "Virtual Function" | cut -d' ' -f1 | \
while read bsf; do \
  sudo mstconfig -y -d $bsf set SRIOV_EN=1 NUM_OF_VFS=16 \
done

Instantiating the VFs in Linux

Once the firmware was updated – and I’d rebooted so that it took effect – I needed to create the VF devices in Linux. In order to make these devices persistent across reboots, I installed the sysfsutils package and wrote a script to generate a config file for it:

$ cat gensysfscfg.sh
#!/bin/sh

for path in /sys/class/net/enp*; do
    if [ "$(basename $(readlink $path/device/driver))" != "mlx5_core" ]; then
	continue
    fi
    if [ ! -f $path/device/mlx5_num_vfs ]; then
	continue
    fi

    if [ ! -f $path/device/sriov_numvfs ]; then
	echo "Error: $path/device/sriov_numvfs does not exist" 1>&2
    fi

    echo "$(echo $path | cut -d/ -f3-)/device/sriov_numvfs = 8"
done
$ ./gensysfscfg.sh | sudo tee /etc/sysfs.d/mlnx-vfs.conf
class/net/enp134s0f0/device/sriov_numvfs = 8
class/net/enp134s0f1/device/sriov_numvfs = 8
class/net/enp184s0/device/sriov_numvfs = 8
class/net/enp189s0/device/sriov_numvfs = 8
class/net/enp225s0/device/sriov_numvfs = 8
class/net/enp230s0/device/sriov_numvfs = 8
class/net/enp53s0/device/sriov_numvfs = 8
class/net/enp58s0/device/sriov_numvfs = 8
class/net/enp88s0/device/sriov_numvfs = 8
class/net/enp93s0/device/sriov_numvfs = 8

To start, I only used 8 of the 16 configured in firmware. I then restarted sysfsutils for it to take effect:

sudo service sysfsutils restart

That caused a lot of kernel messages & new devices appearing under /sys/class/net.

Creating the Guest

I used the uvtool package to create an initial working guest and configured it from there. Note that uvtool will copy in your ssh keys automatically so you can ssh into the guest, so be sure to generate those ahead of time. Sorry for the lack of command output here, I’m just going by memory:

$ sudo apt install uvtool
$ uvt-simplestreams-libvirt --verbose sync arch=amd64 release=bionic
$ ssh-keygen
$ uvt-kvm create test
$ virsh console test

Because I wanted to use a UEFI guest, I also configured that (but this shouldn’t be necessary):

$ virsh shutdown test
$ sudo apt install ovmf
$ virsh edit test

I then edited added the loader & nvram elements as shown below:

  <os>
    <type arch='x86_64' machine='pc-i440fx-bionic'>hvm</type>                   
    <loader readonly='yes' type='pflash'>/usr/share/OVMF/OVMF_CODE.fd</loader>
    <nvram>/var/lib/libvirt/qemu/nvram/test_VARS.fd</nvram>
    <boot dev='hd'/>
  </os>

EDIT: Previously there was a “cp /usr/share/OVMF/OVMF_VARS.fd /var/lib/libvirt/qemu/nvram/test_VARS.fd” command included above, but it was pointed out to me that this is not necessary – libvirt will automagically copy the template for you.

Finally, lets pass through the devices. I created the following script to look for all Mellanox virtual functions and generate <hostdev> entries for them:

#!/bin/sh

emit_snippet() {
    domain="$1"
    bus="$2"
    slot="$3"
    function="$4"

    echo "    <hostdev mode='subsystem' type='pci' managed='yes'>"
    echo "      <source>"
    echo "        <address domain='$1' bus='$2' slot='$3' function='$4'/>"
    echo "      </source>"
    echo "    </hostdev>"
}

for path in /sys/class/net/enp*; do
    if [ "$(basename $(readlink $path/device/driver))" != "mlx5_core" ]; then
	continue
    fi
    if [ -f $path/device/mlx5_num_vfs ]; then
	continue
    fi
    dev="$(basename $path)"
    bsf="$(basename $(readlink /sys/class/net/$dev/device))"
    domain="$(echo $bsf | cut -d: -f1)"
    bus="$(echo $bsf | cut -d: -f2)"
    slotfunction="$(echo $bsf | cut -d: -f3)"
    slot="$(echo $slotfunction | cut -d. -f1)"
    function="$(echo $slotfunction | cut -d. -f2)"

    emit_snippet 0x${domain} \
    		 0x${bus} \
    		 0x${slot} \
     		 0x${function}
done

I inserted the output of that into the <devices> section of my guest XML.

$ virsh edit test
Domain test XML configuration edited.

## Insert the new <hostdev> lines in <devices>

$ sudo virsh define /etc/libvirt/qemu/test.xml
Domain test defined from /etc/libvirt/qemu/test.xml

I started up the guest, which took on the order of 10 seconds:

$ virsh start test; virsh console test

Watching the console, everything seemed to be going OK until:

[   20.145675] mlx5_core 0000:01:04.0: firmware version: 16.25.1020
[   20.152070] random: systemd-udevd: uninitialized urandom read (16 bytes read)
[   20.765671] mlx5_core 0000:01:05.0: firmware version: 16.25.1020
[   21.373804] mlx5_core 0000:01:06.0: firmware version: 16.25.1020
[   21.974761] mlx5_core 0000:01:07.0: firmware version: 16.25.1020
[   22.591561] mlx5_core 0000:01:08.0: firmware version: 16.25.1020
[   23.203812] mlx5_core 0000:01:09.0: firmware version: 16.25.1020
[   23.816257] mlx5_core 0000:01:0a.0: firmware version: 16.25.1020
[   24.424466] mlx5_core 0000:01:0b.0: firmware version: 16.25.1020
[   25.048853] mlx5_core 0000:01:0c.0: firmware version: 16.25.1020
[   25.667828] mlx5_core 0000:01:0d.0: firmware version: 16.25.1020
[   26.274705] mlx5_core 0000:01:0e.0: firmware version: 16.25.1020
[   26.857318] Interrupt reservation exceeds available resources
[   26.894088] mlx5_core 0000:01:0f.0: firmware version: 16.25.1020
[   27.490493] mlx5_core 0000:01:0f.0: mlx5_start_eqs:733:(pid 161): failed to create async EQ -28
[   27.504074] mlx5_core 0000:01:0f.0: Failed to start pages and async EQs
[   27.930859] mlx5_core 0000:01:0f.0: mlx5_load_one failed with error code -28
[   27.936661] mlx5_core: probe of 0000:01:0f.0 failed with error -28
[   27.938305] mlx5_core 0000:01:10.0: firmware version: 16.25.1020
[   28.531192] mlx5_core 0000:01:10.0: mlx5_start_eqs:733:(pid 161): failed to create async EQ -28

The guest completed boot, but not all of the devices were available:

$ uvt-kvm ssh test ls /sys/class/net | wc -l
39

Looking at the guest’s /proc/interrupts file, the problem appeared to be that we only had 1 vcpu, and all of its interrupts were in-use. So, I bumped up the number of guest vcpus in the XML to 4, and things looked better until:

[   34.852454] mlx5_core 0000:01:1b.0: firmware version: 16.25.1020
[   35.489797] mlx5_core 0000:01:1c.0: firmware version: 16.25.1020
[   36.130494] mlx5_core 0000:01:1d.0: firmware version: 16.25.1020
[   36.252079] systemd-udevd invoked oom-killer: gfp_mask=0x14002c0(GFP_KERNEL|__GFP_NOWARN), nodemask=(null), order=0, oom_score_adj=0
[   36.257492] systemd-udevd cpuset=/ mems_allowed=0
[   36.259687] CPU: 0 PID: 166 Comm: systemd-udevd Not tainted 4.15.0-66-generic #75-Ubuntu
[   36.263015] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
[   36.266320] Call Trace:
[   36.267390]  dump_stack+0x63/0x8e
[   36.268635]  dump_header+0x71/0x285

Yeah, uvtool’s default of 512M of memory isn’t nearly enough for this, so I bumped the guest XML up to 8G:

  <memory unit='KiB'>8388608</memory>
  <currentMemory unit='KiB'>8388608</currentMemory>

After that, things booted fine. I took a look inside the guest. Do we see all the interfaces?

ubuntu@test:~$ lspci | grep Mellanox | wc -l
80
ubuntu@test:~$ ls -d /sys/class/net/en* | wc -l
81

That’s all of them, plus the built-in. Only 2 physical nics are wired up, so we should see links on 2×8 VFs, and of course the built-in:

ubuntu@test:/sys/class/net$ for iface in en*; do \
  sudo ip link set dev $iface up; 
  sudo ethtool $iface | grep "Link detected: yes"; \
done | wc -l
17

Can we do more?

I went back and updated the sysfsutils config to expose all 16 VFs per device. And yes, we can do more:

$ lspci | grep Mellanox | wc -l
160
$ ls -d /sys/class/net/en* | wc -l
161

Note that I did have to bump up guest vcpus to > 4 (I chose 16). In theory, we could go even further by bumping up the firmware VF counts > 16.

Bike MS 2019

I’m riding Bike MS Colorado as part of Team Left Hand again this year. We’ll do 100 miles from Denver to Fort Collins, finishing over the Horsetooth Reservoir dam. The next day, we’ll do 76 miles to return to Denver. I’d appreciate your support – even just $5 would be awesome. Donate here.

Here’s what it looked like last year:

Deploying Ubuntu OpenStack to ARM64 servers

At Canonical, we’ve been doing work to make sure Ubuntu OpenStack deploys on ARM servers as easily as on x86. Whether you have Qualcomm 2400 REP boards, Cavium ThunderX boards, HiSilicon D05 boards, or other Ubuntu  Certified server hardware, you can go from bare metal to a working OpenStack in minutes!

The following tutorial will walk you through building a simple Ubuntu OpenStack setup, highlighting any ARM-specific caveats along the way.

Note: very little here is actually ARM specific – you could just as easily follow this to setup an x86 OpenStack.

Juju and MAAS

Ubuntu OpenStack is deployed using MAAS and Juju. If you’re unfamiliar with these tools, let me give you a quick overview.

MAAS is a service that manages clusters of bare-metal servers in a manner similar to cloud instances. Using the web interface, or its API, you can ask MAAS to power on one or more servers and deploy an OS to them, ready for login. In this tutorial, we’ll be adding your ARM servers to a MAAS cluster, so that Juju can deploy and manage them via the MAAS API.

Juju is a workload orchestration tool. It takes definitions of workloads, called bundles, and realizes them in a given cloud environment. In this case, we’ll be deploying Ubuntu’s openstack-base bundle to your MAAS cloud environment.

Hardware Requirements

A minimal Ubuntu OpenStack setup on ARM comprises:

  • 5 ARM server nodes for your MAAS cluster. 4 of these will be used to run OpenStack services, the 5th will operate a Juju controller that manages the deployment.
    • Each system needs to have 2 disks (the second is for ceph storage).
    • Each system needs to have 2 network adapters. To keep this simple, it’s best if the network adapters are identically configured (same NICs, and if plug-in NICs are used, same slots).
    • Each node should be configured to PXE boot by default. If you have one of these systems, checkout the “MAAS Notes” section on the Ubuntu wiki for tips:
  • 1 server to run the MAAS server
    • CPU architecture doesn’t matter.
    • Install this server with Ubuntu Server 16.04. A clean “Basic” installation is recommended.
    • >= 10GB of free disk space.
    • >= 2GB of RAM
  • 1 client system for you to use to execute juju and openstack client commands to initiate, monitor and test out the deployment.
    • Make sure this is a system that can run a web browser, so you can use it to view the Juju, MAAS and OpenStack GUIs.
    • Ubuntu 16.04 is recommended (that’s what we tested with).

Network Layout

Again for simplicity, this tutorial will assume that everything (both NICs of each ARM server, ARM server BMCs, MAAS server, client system and your OpenStack floating IPs) are all on the same flat network (you’ll want more segregation in a production deployment). Cabling should look like the following figure:

We’re using a 10.228.66.0/24 network throughout this tutorial. MAAS will provide a DHCP server for this subnet, so be sure to deactivate any other DHCP servers to avoid interference.

Network Planning

Since all of your IPs will be sharing a single subnet, you should prepare a plan in advance for how you want to split up the IPs to avoid accidental overlap. For example, with our 10.228.66.0/24 network, you might allocate:

  • 10.228.66.1 – Gateway
  • 10.228.66.2:10.228.66.20 – Static IPs (MAAS Server, client system, ARM Server BMCs, etc.)
  • 10.228.66.21:10.228.66.50: MAAS node IP pool (IPs MAAS is allowed to assign to your ARM Server nodes).
  • 10.228.66.51:10.228.66.254: OpenStack floating IP pool (for your OpenStack instances)

OK. Let’s get to it.

MAAS Server Installation

On the MAAS Server, run the following command sequence to install the latest version of MAAS:

sudo apt-add-repository ppa:maas/stable -y
sudo apt update
sudo apt install maas -y

Once MAAS is installed, run the following command to setup admin username and password:

ubuntu@maas:~$ sudo maas createadmin
Username: ubuntu
Password: 
Again: 
Email: ubuntu@example.org
Import SSH keys [] (lp:user-id or gh:user-id): lp:<lpuserid>

Using a web browser from the client system, connect to the MAAS web interface. It is at http://<MAAS Server IP addr>/MAAS :

Login with the admin credentials you just created. Select arm64 in Architectures of image sources  and click “Update Selection”. Wait for image download and sync. After all images are synced, Click “Continue”.

Import one or more ssh keys. You can paste them in, or easily import from Launchpad or GitHub:

 

After basic setup, Goto “Subnets” tag, and click the subnet address:

Provide the correct “Gateway IP” and “DNS” address for your subnet:

Next, goto “Subnets” tag and click untagged VLAN:

Select “Provide dhcp” in the “Take action” pulldown:

Many ARM servers (other than X-Gene/X-Gene 2 systems and Cavium ThunderX CRBs) require the 16.04 HWE kernel, so we need to configure MAAS to use it by default. Go to the “Settings” tab and select “xenial (hwe-16.04)” as the Default Minimum Kernel Version for Commissioning, then click “Save”:

Enlisting and Commissioning Nodes

In order for MAAS to manage your ARM Server nodes, they need to be first enlisted into MAAS, then commissioned. To do so, power on the node, and allow it to PXE boot from the MAAS server. This should cause the node to appear with a randomly generated name on the “Nodes” page:

Click on the Node name, and select “Commission” in the “Take action” menu. This will begin a system inventory process after which the node’s status will become “Ready”.

Repeat for all other nodes.

Testing out MAAS

Before we deploy OpenStack, it’d be good to first demonstrate that your MAAS cluster is functioning properly.

From the “Nodes” page in the MAAS UI, select a node and choose the “Deploy” action in the “Take action” pulldown:

When status becomes “Deployed”, you can ssh into the node with username “ubuntu” and your ssh private key. You can find a node’s IP address by clicking the node’s hostname and looking at the Interfaces tab:

Now ssh to that node with username “ubuntu” and the ssh key you configured in MAAS earlier:

All good? OK – release the node back to the cluster via the MAAS UI, and let’s move onto deploying OpenStack!

Deploying OpenStack

Download the bundle .zip file from https://jujucharms.com/openstack-base/50 to your client system and extract it:

ubuntu@jujuclient$ sudo apt install unzip -y
ubuntu@jujuclient$ unzip openstack-base.zip

The following files will be extracted:

  • bundle.yaml: This file defines the modeling and placement of services across your OpenStack cluster
  • neutron-ext-net, neutron-tenant-net: scripts to help configure your OpenStack networks
  • novarc: script to setup your environment to use OpenStack

Next, install the Juju client from the snap store to your client system:

ubuntu@jujuclient$ sudo snap install juju --classic

Then, configure Juju to use your MAAS environment, as described here.

After configuring Juju to use your MAAS cluster, run the following command Juju client system to instantiate a Juju controller node:

ubuntu@jujuclient$ juju bootstrap maas-cloud maas \
--bootstrap-constraints arch=arm64

Where “maas-cloud” is the cloud name you asssigned in the “Configure Juju” step. Juju will auto select a node from the MAAS cluster to be the Juju controller, and deploy the node. You can monitor this progress via the MAAS web interface and the console of the bootstrap node.

Now, deploy the OpenStack bundle:

  • Locate the bundle.yaml file from the openstack-base.zip tarball.
  • Open the bundle.yaml file in a text editor, and locate the data-port setting for the neutron-gateway service.
  • Change the data-port setting as appropriate for the systems in your MAAS cluster. This should be name of the connected NIC interface on your systems that is not configured by MAAS (see the diagram in the “Network Layout” section of this post). For example, if you have a cluster of systems like the one showed on the MAAS interfaces tab screenshot above,  you would want to set data-port to either br-ex:enp1s0 or br-ex:enP4p1s0 (enaqcom8070i0 is the one configured by MAAS). For more information, see the “Port Configuration” section in the neutron-gateway charm docs.
  • Execute:
ubuntu@jujuclient$ juju deploy bundle.yaml

You can monitor the status of your deployment using the juju “status” command:

ubuntu@jujuclient$ juju status

Note: The deployment is complete once juju status reports all units, other than ntp, as “Unit is ready”. (The ntp charm has not yet been updated to report status, so ntp units will not report a “Unit is ready” message).
Note: You can also view a graphical representation of the deployment and it’s status using the juju gui web interface:

ubuntu@jujuclient$ juju gui
GUI 2.10.2 for model "admin/default" is enabled at:
 https://10.228.66.11:17070/gui/u/admin/default
Your login credential is:
 username: admin
 password: d954cc41130218e590c62075de0851df

Troubleshooting Deployment
If the neutron-gateway charm enters a “failed” state, it maybe because you have entered an invalid interface for the data-port config setting. You can change this setting after deploying the bundle using the juju set-config command, and asking the unit to retry:

ubuntu@jujuclient$ juju config neutron-gateway data-port=br-ex:<iface>
ubuntu@jujuclient$ juju resolved neutron-gateway/0

If the device name is not consistent between hosts, you can specify the same bridge multiple times with MAC addresses instead of interface names. The charm will loop through the list and configure the first matching interface. To do so, specify a list of macs using a space delimiter as seen in the example below:

ubuntu@jujuclient$ juju config neutron-gateway data-port=br-ex:<MAC> br-ex:<MAC> br-ex:<MAC> br-ex:<MAC>
ubuntu@jujuclient$ juju resolved neutron-gateway/0

Testing it Out

See the “Ensure it’s working” section of the following document to complete a sample configuration and launch a test instance:
https://jujucharms.com/u/dannf/openstack-base

(^ This is a fork of the main charm docs w/ some corrections pending merge).

Finally, you can access the OpenStack web interface at:
http://<ip of openstack-dashboard>/horizon
To obtain the openstack-dashboard ip address, run:

ubuntu@jujuclient$ juju run --unit openstack-dashboard/0 'unit-get public-address'

Login as user ‘admin’ with password ‘openstack’

Many thanks to Sean Feole for helping draft this guide, and to Michael Reed & Ike Pan for testing it out 🙂

arm64 trusty images now work on GICv3 hosts

Ubuntu 14.04 originally shipped with a 4.4-based kernel, which didn’t yet support booting as a KVM guest on GICv3 systems. This meant you could only boot trusty instances on GICv2-based hosts. However, thanks to our Foundations team, Ubuntu 14.04 (‘trusty’)/arm64 images have switched to using the Ubuntu HWE kernel. This means you can now run Ubuntu 14.04 (‘trusty’) KVM guests on GICv3 ARM64 hosts, such as those based on Cavium ThunderX, HiSilicon Hip07 and Qualcomm Centriq.

Ubuntu cloud images are available here. See this page for info on how to boot cloud images directly in QEMU. Note: these images will also  just work with the Newton release of OpenStack on Ubuntu – more on that later.