docs/use-cases/using-SRIOV-and-kata.md
Single Root I/O Virtualization (SR-IOV) enables splitting a physical device into virtual functions (VFs). Virtual functions enable direct passthrough to virtual machines or containers. For Kata Containers, we enabled a Container Network Model (CNM) plugin. Additionally, we made the necessary changes in the runtime to detect virtual functions in a container's network namespace to use SR-IOV for network based devices.
To create a network with associated VFs, which can be passed to
Kata Containers, you must install a SR-IOV Docker plugin. The
created network is based on a physical function (PF) device. The network can
create n containers, where n is the number of VFs associated with the
Physical Function (PF).
To install the plugin, follow the plugin installation instructions.
In order to setup your host for SR-IOV, the following has to be true:
CONFIG_VFIO_NOIOMMU must be disabled in the host kernel
configuration. You must rebuild your host system's kernel in
order to disable CONFIG_VFIO_NOIOMMU in the kernel configuration.Besides, you need to enable the NIC driver in your guest kernel config (e.g. mlx5 for Mellanox NIC). All the modules need to be complied as built-in instead of loadable.
The following is an example of how to use lspci to check if your NIC supports
SR-IOV.
$ lspci | grep -i -F ethernet
01:00.0 Ethernet controller: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 03)
...
$ #sudo required below to read the card capabilities
$ sudo lspci -s 01:00.0 -v | grep SR-IOV
Capabilities: [160] Single Root I/O Virtualization (SR-IOV)
If your card does not report this capability, then it does not support SR-IOV.
Run the following command to see how the IOMMU groups are setup on your host system:
$ find /sys/kernel/iommu_groups/ -type l
The command's output details whether or not your NIC is setup appropriately with respect to PCIe Access Control Services (ACS). If the IOMMU groups are setup properly, the PCI for each ACS-enabled NIC port should be in its own IOMMU group. If the PCI bridge is within the same IOMMU group as your NIC, it indicates that either your device does not support ACS or your device does not appropriately share this default capability.
If you do not see any output when running the previous command, then you likely need to update your host's kernel configuration.
For more details, see the blog post, "IOMMU Groups, inside and out"
Depending on your host kernel configuration, you might have to rebuild the kernel. If the following conditions are true, you do not need to rebuild your kernel:
CONFIG_VFIO_IOMMU_TYPE1, CONFIG_VFIO, and CONFIG_VFIO_PCI are set in
the kernel configuration. Your kernel is built with VFIO support when
configurations are set.CONFIG_VFIO_NOIOMMU is disabled in the host kernel configuration.See the following steps one through three if you need to rebuild the kernel.
The following steps, which are based on the Ubuntu 16.04 distribution, update the SR-IOV host system. If you use a different distribution, make appropriate adjustments to the commands.
Before building a new kernel, keep in mind:
Grab kernel sources:
$ sudo apt-get install linux-source-<linux-version>
$ sudo apt-get install linux-headers-<linux-version>
$ cd /usr/src/linux-source-<linux-version>/
$ sudo tar -xvf linux-source-<linux-version>.tar.bz2
$ cd linux-source-<linux-version>
$ sudo apt-get install libssl-dev
Examine and update the config file if necessary:
$ sudo cp /boot/config-4.8.0-36-generic .config
$ # verify resulting .config does not have NOIOMMU set; ie: `CONFIG_VFIO_NOIOMMU` is not set
$ grep -q "^CONFIG_VFIO_NOIOMMU" /boot/config-$(uname -r) || echo ok
$ # verify `CONFIG_VFIO_IOMMU_TYPE1`, `CONFIG_VFIO=m` and `CONFIG_VFIO_PCI=m` are set as well.
$ for opt in CONFIG_VFIO_IOMMU_TYPE1 CONFIG_VFIO CONFIG_VFIO_PCI
do
grep "^${opt}=" /boot/config-$(uname -r)
done
$ sudo make olddefconfig
You might want to modify the kernel Makefile to add a unique identifier
to the EXTRAVERSION variable prior to running the make. Including the EXTRAVERSION
variable causes the uname -r command to indicate that a customized kernel is
installed and running.
Build and install the kernel:
$ make -j <number_of_cpus>
$ make modules
$ sudo make modules_install
$ sudo make install
Edit grub to enable intel-iommu:
edit /etc/default/grub and add intel_iommu=on to cmdline:
$ sudo sed -i -e 's/^kernel_params = "\(.*\)"/GRUB_CMDLINE_LINUX="\1 intel_iommu=on"/g' /etc/default/grub
$ sudo update-grub
Reboot the system and verify:
Host system should be ready now. Reboot the system.
$ sudo reboot
To verify the kernel version and the kernel command line, take a look at
/proc/version and /proc/cmdline
Verify Intel VT-d is initialized:
To check if Intel VT-d initialized correctly, look for the following
line in the dmesg output:
DMAR: Intel(R) Virtualization Technology for Directed I/O
Older kernels use a different prefix (e.g. PCI-DMA):
PCI-DMA: Intel(R) Virtualization Technology for Directed I/O
Add the vfio-pci module:
sudo modprobe vfio-pci
Add PCI quirk for SR-IOV NIC if necessary:
$ find /sys/kernel/iommu_groups/ -type l
The previous command verifies that your NIC appears in its own IOMMU group
and no other devices appear in the same group. In the rare case where your
PCI NIC does not appear in its own group, it is likely that the NIC does
not support ACS or you built and ran an old kernel. Depending on your NIC
and if it enforces isolation, you might resolve this by adding a
pcie_acs_override= option to your kernel command line and reboot.
See PCIE-ACS-override-option for
detailed information about this option.
All the steps in prior sections need to be performed just once to prepare the SR-IOV host systems. The following is needed per system boot in order to facilitate setting up a physical device's virtual functions.
The following procedure sets up your SR-IOV device and needs to be done per system boot. Set up includes loading a device driver, finding out how many virtual functions (VF) you can create, and creating those virtual functions. Once you create VFs you cannot increase or decrease the number of VFs without first setting the number back to zero. Based on this, it is expected that you set the number of VFs for a physical device just once.
Add vfio-pci device driver:
$ sudo modprobe vfio-pci
vfio-pci is a driver used to reserve a VF PCI device.
Find the NICs of interest:
$ lspci | grep Ethernet
00:19.0 Ethernet controller: Intel Corporation Ethernet Connection I217-LM (rev 04)
01:00.0 Ethernet controller: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01)
01:00.1 Ethernet controller: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01)
The previous example finds the PCI details for the NICs in question.
In our case, both 01:00.0 and 01:00.1 are the two ports on our x540-AT2 card
that we will use. You can use lshw command to get further details on the
controller and verify it supports SR-IOV.
Check how many VFs you can create:
$ cat /sys/bus/pci/devices/0000\:01\:00.0/sriov_totalvfs
63
$ cat /sys/bus/pci/devices/0000\:01\:00.1/sriov_totalvfs
63
The previous commands show how many VFs you can create. The sriov_totalvfs
file under sysfs for a PCI device specifies the total number of VFs that you
can create.
Create the VFs:
# echo 1 | sudo tee /sys/bus/pci/devices/0000\:01\:00.0/sriov_numvfs
# echo 1 | sudo tee /sys/bus/pci/devices/0000\:01\:00.1/sriov_numvfs
Create virtual functions by editing sriov_numvfs. In our example, we create
virtual functions by editing sriov_numvfs. This example
creates one VF per physical device. Note, creating one VF eliminates the
usefulness of SR-IOV, and is done for simplicity in this example.
Verify the VFs were added to the host:
$ sudo lspci | grep Ethernet | grep Virtual
02:10.0 Ethernet controller: Intel Corporation X540 Ethernet Controller Virtual Function (rev 01)
02:10.1 Ethernet controller: Intel Corporation X540 Ethernet Controller Virtual Function (rev 01)
Assign a MAC address to each VF:
$ sudo ip link set <pf> vf <vfidx> mac <fake MAC address>
Depending on the NIC being used, you might need to explicitly set the MAC
address for the VF device. Setting the MAC address guarantees that the
address is consistent on the host and when passed to the guest. Verify a MAC
address is assigned to the VF using command ip link show dev <vf>.
The following example launches a Kata Containers container using SR-IOV:
Build and start SR-IOV plugin:
To install the SR-IOV plugin, follow the SR-IOV plugin installation instructions
Create the docker network:
$ sudo docker network create -d sriov --internal --opt pf_iface=enp1s0f0 --opt vlanid=100 --subnet=192.168.0.0/24 vfnet
E0505 09:35:40.550129 2541 plugin.go:297] Numvfs and Totalvfs are not same on the PF - Initialize numvfs to totalvfs
ee2e5a594f9e4d3796eda972f3b46e52342aea04cbae8e5eac9b2dd6ff37b067
The previous commands create the required SR-IOV docker network, subnet, vlanid,
and physical interface.
Start containers and test their connectivity:
$ sudo docker run --runtime=kata-runtime --net=vfnet --cap-add SYS_ADMIN --ip=192.168.0.10 -it alpine
The previous example starts a container making use of SR-IOV.
If two machines with SR-IOV enabled NICs are connected back-to-back and each
has a network with matching vlanid created, use the following two commands
to test the connectivity:
Machine 1:
sriov-1:~$ sudo docker run --runtime=kata-runtime --net=vfnet --cap-add SYS_ADMIN --ip=192.168.0.10 -it mcastelino/iperf bash -c "mount -t ramfs -o size=20M ramfs /tmp; iperf3 -s"
Machine 2:
sriov-2:~$ sudo docker run --runtime=kata-runtime --net=vfnet --cap-add SYS_ADMIN --ip=192.168.0.11 -it mcastelino/iperf iperf3 -c 192.168.0.10 bash -c "mount -t ramfs -o size=20M ramfs /tmp; iperf3 -c 192.168.0.10"