Configuring SR-IOV for Mellanox adapters

SR-IOV is a virtualization technique which allows a physical PCI-E device to spawn many virtual functions. These virtual functions appear as normal PCI-E devices and could be passed through to virtual machines. This allows the VMs to benefit from direct hardware access while allowing the underlying device to be shared.

In my case, I needed to enable SR-IOV on the Mellanox InfiniBand adapters in order to build a HPC cluster on top of VMs. Having direct InfiniBand access inside VM is crucial for applications and libraries such as MPI to fully utilize the hardware capabilities like RDMA, high bandwidth, and low latency compared to a (para-)virtualized Ethernet adapter.

In this post, I will talk about how to setup SR-IOV for Mellanox InfiniBand adapters. You could configure it on a Mellanox Ethernet adapter as well, for that you could ignore the part about subnet manager below.

Overview

Let’s first have an overview of what need to be done. We need to understand that enabling SR-IOV is not only about touching the OFED driver, but configuring BIOS, adapter firmware, and the subnet manager as well. To be specific, we need to

  • Enable virtualization in the subnet manager
  • Enable I/O virtualization in BIOS and OS
  • Enable SR-IOV in adapter firmware
  • Create virtual functions with the driver

In my setup, I used a ConnectX-5 VPI card and UFM 6.1.0 as the subnet manager. However, the steps should work for the recent few generations of Mellanox adapters and recent versions of opensm/UFM. Now let’s see how we could do these step by step.

Enabling virtualization in the subnet manager

InfiniBand network has a subnet manager (SM) that coordinates the entire subnet. We need to enable virtualization in the SM so that it recognizes messages from virtual adapters and can correctly add them to the subnet. Edit the SM configuration file and change the following setting to 2.

virt_enabled 2

For opensm the path should be /etc/opensm/opensm.conf and for UFM the path is /opt/ufm/files/conf/opensm/opensm.conf. Note that since MLNX OFED 4.4 and UFM 6.0, virt_enabled is set to 2 by default. If you need to change this value, restart your SM.

Enabling I/O virtualization in BIOS and OS

We must enable I/O virtualization in the BIOS, this is needed so that you could passthrough the device to VMs. On Intel CPUs, this feature is called Intel VT-d. On AMD CPUs it’s called AMD-Vi. Go into your BIOS and enable the feature corresponding to your CPU.

We also need to change the IOMMU relevant kernel boot parameters for the OS to support I/O virtualization. In my case I’m using an Intel CPU with grub as the bootloader so I did the following commands. For AMD CPU, please consult the Linux kernel IOMMU documentation on what parameters to pass to the kernel.

$ echo "intel_iommu=on iommu=pt" >> /etc/default/grub
$ grub2-mkconfig -o /boot/efi/EFI/centos/grub.cfg
$ reboot

Once these are all done, you should see the following kernel message.

$ dmesg | grep IOMMU
[ 0.000000] DMAR: IOMMU enabled

Enabling SR-IOV in the firmware

By default, SR-IOV is disabled in the Mellanox adapter firmware. The SR-IOV feature is only exposed to the driver after it’s enabled in the firmware level. First, we start the Mellanox Software Tools (MST) driver so that we could configured the firmware. This creates devices nodes under /dev/mst. If you have multiple Mellanox cards, please make sure you know which one you want to configure by looking at the PCI-E bus ID (0000:3b:00.0 in this case).

$ mst start
$ mst status
/dev/mst/mt4119_pciconf0 - PCI configuration cycles access.
                           domain:bus:dev.fn=0000:3b:00.0 addr.reg=88
data.reg=92
                           Chip revision is: 00

For me the one I need to configure is /dev/mst/mt4119_pciconf0. Here I enabled SR-IOV and set the maximum number of virtual functions (VFS) to be 127, which is the hardware limit. Note that it is OK to have this max value set in the firmware even if you may not need all of them, you can always control how many VFS you want later inside the driver.

$ mlxconfig -d /dev/mst/mt4119_pciconf0 set SRIOV_EN=1 NUM_OF_VFS=127
$ reboot

After reboot, you could confirm your setting by querying the MST device.

$ mst start
$ mlxconfig -d /dev/mst/mt4119_pciconf0 q | grep SRIOV_EN
         SRIOV_EN                            True(1)

Enabling SR-IOV in the driver

We finally come to the last step in the whole process. Now we need to initialize these VFS. Assume we only want a single VF, we could do

$ echo 1 > /sys/class/infiniband/mlx5_0/device/mlx5_num_vfs

Then we configure this VF with

$ echo Follow > /sys/class/infiniband/mlx5_0/device/sriov/0/policy
$ echo 11:22:33:44:77:66:77:90 > /sys/class/infiniband/mlx5_0/device/sriov/0/node
$ echo 11:22:33:44:77:66:77:91 > /sys/class/infiniband/mlx5_0/device/sriov/0/port
$ echo 0000:3b:00.1 > /sys/bus/pci/drivers/mlx5_core/unbind
$ echo 0000:3b:00.1 > /sys/bus/pci/drivers/mlx5_core/bind

Here we need to assign a GUID to both the node (i.e. the device) and port, and we need to rebind the OFED driver for the new GUID to take effect. Note that you need to do this on every boot and we are only configuring a single VF! It would be tedious to repeat this if you want many VFS. Therefore, I wrote a script to automatically initialize the VFS. You could find it on GitHub. I made in run on boot so I always have the VFS setup for my VMs.

Conclusion

In this post, I described the necessary steps to enable SR-IOV for Mellanox InfiniBand. It involes several steps but I believe the logic is clear. The setup may take some time, but you only need to do it once (with the auto VFS initialization). I really enjoyed using SR-IOV as it provides great performance running MPI applications in the VMs. I hope you find this post useful and are able to successfully utilize SR-IOV for whatever you might be doing.

References

 
comments powered by Disqus