-
Task
-
Resolution: Unresolved
-
Medium
-
None
-
None
-
None
-
None
When running multiple VPPs in parallel, we're observing an issue with VFs not coming up with dpdk plugin:
With config
dev 0000:05:10.0
dev 0000:05:10.1
Only one interface is up though:
Name Idx State MTU (L3/IP4/IP6/MPLS) Counter Count
VirtualFunctionEthernet5/10/0 1 down 9000/0/0/0
local0 0 down 0/0/0/0
VPP logs shows that an error in dpdk plugin is associated with this failure state:
2020/12/02 10:44:04:317 notice dpdk i40evf_check_api_version(): PF/VF API version mismatch:(0.0)-(1.1)
2020/12/02 10:44:04:317 notice dpdk i40evf_init_vf(): check_api version failed
2020/12/02 10:44:04:317 notice dpdk i40evf_dev_init(): Init vf failed
2020/12/02 10:44:04:317 notice dpdk EAL: Releasing pci mapped resource for 0000:05:10.1
2020/12/02 10:44:04:317 notice dpdk EAL: Calling pci_unmap_resource for 0000:05:10.1 at 0x2101014000
2020/12/02 10:44:04:317 notice dpdk EAL: Calling pci_unmap_resource for 0000:05:10.1 at 0x2101024000
2020/12/02 10:44:04:317 notice dpdk EAL: Requested device 0000:05:10.1 cannot be used
There are also associated error logs in dmesg:
[Thu Dec 3 02:30:56 2020] i40e 0000:05:00.1: Unable to send the message to VF 49 aq_err 12
[Thu Dec 3 02:30:56 2020] i40e 0000:05:00.1: Unable to send the message to VF 52 aq_err 12
[Thu Dec 3 02:30:56 2020] i40e 0000:05:00.1: Unable to send the message to VF 54 aq_err 12
0000:05:00.1 is the PF associated with 0000:05:10.1, VF 48 is 0000:05:10.0, VF 49 0000:05:10.1 and so on.
This suggests and issue with either NIC firmware or the i40e driver. More info in comment from Andrew below.
---------------------------------------------
OLD DESCRIPTION:
PAPI that was executed:
sw_interface_dump(name_filter_valid=False,name_filter='')
Show PCI right after VPP starts shows at 10:16:57.004:
show pci
Address Sock VID:PID Link Speed Driver Product Name Vital Product Data
0000:91:02.4 1 8086:154c unknown vfio-pci
0000:91:02.5 1 8086:154c unknown vfio-pci
vpp#
But then sw_interface_dump returns only partial interface information - 0000:91:02.4 is missing at 10:17:00.177 (3 seconds later):
[OrderedDict([('_0', 84), ('context', 2377), ('sw_if_index', 0), ('sup_sw_if_index', 0), ('l2_address', MACAddress(00:00:00:00:00:00)), ('flags', <vl_api_if_status_flags_t.0: 0>), ('type', <vl_api_if_type_t.IF_API_TYPE_HARDWARE: 0>), ('link_duplex', <vl_api_link_duplex_t.LINK_DUPLEX_API_UNKNOWN: 0>), ('link_speed', 0), ('link_mtu', 0), ('mtu', [0, 0, 0, 0]), ('sub_id', 0), ('sub_number_of_tags', 0), ('sub_outer_vlan_id', 0), ('sub_inner_vlan_id', 0), ('sub_if_flags', <vl_api_sub_if_flags_t.0: 0>), ('vtr_op', 0), ('vtr_push_dot1q', 0), ('vtr_tag1', 0), ('vtr_tag2', 0), ('outer_tag', 0), ('b_dmac', MACAddress(00:00:00:00:00:00)), ('b_smac', MACAddress(00:00:00:00:00:00)), ('b_vlanid', 0), ('i_sid', 0), ('interface_name', 'local0'), ('interface_dev_type', 'local'), ('tag', '')]),
OrderedDict([('_0', 84), ('context', 2377), ('sw_if_index', 1), ('sup_sw_if_index', 1), ('l2_address', MACAddress(ba:dc:0f:fe:02:05)), ('flags', <vl_api_if_status_flags_t.IF_STATUS_API_FLAG_LINK_UP|IF_STATUS_API_FLAG_ADMIN_UP: 3>), ('type', <vl_api_if_type_t.IF_API_TYPE_HARDWARE: 0>), ('link_duplex', <vl_api_link_duplex_t.LINK_DUPLEX_API_UNKNOWN: 0>), ('link_speed', 40000000), ('link_mtu', 9206), ('mtu', [9000, 0, 0, 0]), ('sub_id', 0), ('sub_number_of_tags', 0), ('sub_outer_vlan_id', 0), ('sub_inner_vlan_id', 0), ('sub_if_flags', <vl_api_sub_if_flags_t.0: 0>), ('vtr_op', 0), ('vtr_push_dot1q', 0), ('vtr_tag1', 0), ('vtr_tag2', 0), ('outer_tag', 0), ('b_dmac', MACAddress(00:00:00:00:00:00)), ('b_smac', MACAddress(00:00:00:00:00:00)), ('b_vlanid', 0), ('i_sid', 0), ('interface_name', 'VirtualFunctionEthernet91/2/5'), ('interface_dev_type', 'dpdk'), ('tag', '')])]
When VPP starts, the dpdk plugin complains that /dev/vfio/142 doesn't exist:
2020/11/10 12:36:03:864 notice dpdk EAL: Detected 64 lcore(s)
2020/11/10 12:36:03:864 notice dpdk EAL: Detected 2 NUMA nodes
2020/11/10 12:36:03:864 notice dpdk EAL: Selected IOVA mode 'VA'
2020/11/10 12:36:03:864 notice dpdk EAL: Probing VFIO support...
2020/11/10 12:36:03:864 notice dpdk EAL: VFIO support initialized
2020/11/10 12:36:03:864 notice dpdk EAL: Cannot open /dev/vfio/142: No such device
2020/11/10 12:36:03:864 notice dpdk EAL: Failed to open group 142
2020/11/10 12:36:03:864 notice dpdk EAL: Requested device 0000:91:02.0 cannot be used
2020/11/10 12:36:03:864 notice dpdk EAL: using IOMMU type 1 (Type 1)
2020/11/10 12:36:03:864 notice dpdk EAL: Probe PCI driver: net_i40e_vf (8086:154c) device: 0000:91:02.1 (socket 1)
2020/11/10 12:36:03:864 notice dpdk EAL: No legacy callbacks, legacy socket not created
When this error happens, all subsequent tests fail the same way.
Another way this can fail:
2020/11/09 16:05:39:251 notice dpdk EAL: Probe PCI driver: net_i40e_vf (8086:154c) device: 0000:91:02.5 (socket 1)
2020/11/09 16:05:39:251 notice dpdk i40evf_check_api_version(): PF/VF API version mismatch:(0.0)-(1.1)
2020/11/09 16:05:39:251 notice dpdk i40evf_init_vf(): check_api version failed
2020/11/09 16:05:39:251 notice dpdk i40evf_dev_init(): Init vf failed
2020/11/09 16:05:39:251 notice dpdk EAL: Releasing pci mapped resource for 0000:91:02.5
2020/11/09 16:05:39:251 notice dpdk EAL: Calling pci_unmap_resource for 0000:91:02.5 at 0x2101014000
2020/11/09 16:05:39:251 notice dpdk EAL: Calling pci_unmap_resource for 0000:91:02.5 at 0x2101024000
2020/11/09 16:05:39:251 notice dpdk EAL: Requested device 0000:91:02.5 cannot be used
This failure only results in one failed test, suggesting some bug in DPDK.
Yet another way:
2020/11/25 13:59:21:696 notice dpdk EAL: Detected 64 lcore(s)
2020/11/25 13:59:21:696 notice dpdk EAL: Detected 2 NUMA nodes
2020/11/25 13:59:21:696 notice dpdk EAL: Selected IOVA mode 'VA'
2020/11/25 13:59:21:696 notice dpdk EAL: Probing VFIO support...
2020/11/25 13:59:21:696 notice dpdk EAL: VFIO support initialized
2020/11/25 13:59:21:696 notice dpdk EAL: using IOMMU type 1 (Type 1)
2020/11/25 13:59:21:696 notice dpdk EAL: Getting a vfio_dev_fd for 0000:91:02.0 failed
2020/11/25 13:59:21:696 notice dpdk EAL: Requested device 0000:91:02.0 cannot be used
2020/11/25 13:59:21:696 notice dpdk EAL: Cannot open /dev/vfio/143: Device or resource busy
2020/11/25 13:59:21:696 notice dpdk EAL: Failed to open group 143
2020/11/25 13:59:21:696 notice dpdk EAL: Requested device 0000:91:02.1 cannot be used
2020/11/25 13:59:21:696 notice dpdk EAL: Bus (pci) probe failed.
There is something wrong with /dev/vfio/<n>.
The issue happens sporadically, much more frequently on aach64 than x86_64. It only happens when there are multiple jobs running, suggesting this happens when multiple VPP instances simultaneously use the same PF (associated with the different VFs the different VPP instances are using). This would be a race condition and it makes sense that it happens more frequently on arm, since those servers are slower, providing a bigger window for overlap where the race condition could happen.
ADDITIONAL INFORMATION:
NIC:
Intel Ethernet Converged Network Adapter XL710-Q2
cat /sys/class/net/enp5s0f0/device/device
0x1583
cat /sys/class/net/enp5s0f0/device/vendor
0x8086
cat /sys/class/net/enp5s0f0/device/subsystem_device
0x0001
cat /sys/class/net/enp5s0f0/device/subsystem_vendor
0x8086
DRIVER/FIRMWARE VERSION:
driver: i40e
version: 2.1.14-k
firmware-version: 6.01 0x800035da 1.1747.0
expansion-rom-version:
bus-info: 0000:05:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes
The PF is using the i40e driver above.
VFs are using vfio-pci.
The environment where I've gathered logs from is Ubuntu18.04 with this kernel:
Linux s55-t36-sut1 4.15.0-55-generic #60-Ubuntu SMP Tue Jul 2 18:23:38 UTC 2019 aarch64 aarch64 aarch64 GNU/Linux
This also happens on x86 with this kernel (the environment is otherwise the same):
Linux s1-t11-sut1 4.15.0-23-generic #25-Ubuntu SMP Wed May 23 18:02:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux