Uploaded image for project: 'vpp'
  1. vpp
  2. VPP-1943

sw_interface_dump sometimes doesn't return info from all interfaces

XMLWordPrintable

    • Icon: Task Task
    • Resolution: Unresolved
    • Icon: Medium Medium
    • None
    • None
    • None
    • None

      When running multiple VPPs in parallel, we're observing an issue with VFs not coming up with dpdk plugin:
      With config
      dev 0000:05:10.0
      dev 0000:05:10.1

      Only one interface is up though:
      Name Idx State MTU (L3/IP4/IP6/MPLS) Counter Count
      VirtualFunctionEthernet5/10/0 1 down 9000/0/0/0
      local0 0 down 0/0/0/0

      VPP logs shows that an error in dpdk plugin is associated with this failure state:
      2020/12/02 10:44:04:317 notice dpdk i40evf_check_api_version(): PF/VF API version mismatch:(0.0)-(1.1)
      2020/12/02 10:44:04:317 notice dpdk i40evf_init_vf(): check_api version failed
      2020/12/02 10:44:04:317 notice dpdk i40evf_dev_init(): Init vf failed
      2020/12/02 10:44:04:317 notice dpdk EAL: Releasing pci mapped resource for 0000:05:10.1
      2020/12/02 10:44:04:317 notice dpdk EAL: Calling pci_unmap_resource for 0000:05:10.1 at 0x2101014000
      2020/12/02 10:44:04:317 notice dpdk EAL: Calling pci_unmap_resource for 0000:05:10.1 at 0x2101024000
      2020/12/02 10:44:04:317 notice dpdk EAL: Requested device 0000:05:10.1 cannot be used

      There are also associated error logs in dmesg:
      [Thu Dec 3 02:30:56 2020] i40e 0000:05:00.1: Unable to send the message to VF 49 aq_err 12
      [Thu Dec 3 02:30:56 2020] i40e 0000:05:00.1: Unable to send the message to VF 52 aq_err 12
      [Thu Dec 3 02:30:56 2020] i40e 0000:05:00.1: Unable to send the message to VF 54 aq_err 12

      0000:05:00.1 is the PF associated with 0000:05:10.1, VF 48 is 0000:05:10.0, VF 49 0000:05:10.1 and so on.

      This suggests and issue with either NIC firmware or the i40e driver. More info in comment from Andrew below.

      ---------------------------------------------
      OLD DESCRIPTION:

      PAPI that was executed:
      sw_interface_dump(name_filter_valid=False,name_filter='')

      Show PCI right after VPP starts shows at 10:16:57.004:
      show pci
      Address Sock VID:PID Link Speed Driver Product Name Vital Product Data
      0000:91:02.4 1 8086:154c unknown vfio-pci
      0000:91:02.5 1 8086:154c unknown vfio-pci
      vpp#

      But then sw_interface_dump returns only partial interface information - 0000:91:02.4 is missing at 10:17:00.177 (3 seconds later):

      [OrderedDict([('_0', 84), ('context', 2377), ('sw_if_index', 0), ('sup_sw_if_index', 0), ('l2_address', MACAddress(00:00:00:00:00:00)), ('flags', <vl_api_if_status_flags_t.0: 0>), ('type', <vl_api_if_type_t.IF_API_TYPE_HARDWARE: 0>), ('link_duplex', <vl_api_link_duplex_t.LINK_DUPLEX_API_UNKNOWN: 0>), ('link_speed', 0), ('link_mtu', 0), ('mtu', [0, 0, 0, 0]), ('sub_id', 0), ('sub_number_of_tags', 0), ('sub_outer_vlan_id', 0), ('sub_inner_vlan_id', 0), ('sub_if_flags', <vl_api_sub_if_flags_t.0: 0>), ('vtr_op', 0), ('vtr_push_dot1q', 0), ('vtr_tag1', 0), ('vtr_tag2', 0), ('outer_tag', 0), ('b_dmac', MACAddress(00:00:00:00:00:00)), ('b_smac', MACAddress(00:00:00:00:00:00)), ('b_vlanid', 0), ('i_sid', 0), ('interface_name', 'local0'), ('interface_dev_type', 'local'), ('tag', '')]),

      OrderedDict([('_0', 84), ('context', 2377), ('sw_if_index', 1), ('sup_sw_if_index', 1), ('l2_address', MACAddress(ba:dc:0f:fe:02:05)), ('flags', <vl_api_if_status_flags_t.IF_STATUS_API_FLAG_LINK_UP|IF_STATUS_API_FLAG_ADMIN_UP: 3>), ('type', <vl_api_if_type_t.IF_API_TYPE_HARDWARE: 0>), ('link_duplex', <vl_api_link_duplex_t.LINK_DUPLEX_API_UNKNOWN: 0>), ('link_speed', 40000000), ('link_mtu', 9206), ('mtu', [9000, 0, 0, 0]), ('sub_id', 0), ('sub_number_of_tags', 0), ('sub_outer_vlan_id', 0), ('sub_inner_vlan_id', 0), ('sub_if_flags', <vl_api_sub_if_flags_t.0: 0>), ('vtr_op', 0), ('vtr_push_dot1q', 0), ('vtr_tag1', 0), ('vtr_tag2', 0), ('outer_tag', 0), ('b_dmac', MACAddress(00:00:00:00:00:00)), ('b_smac', MACAddress(00:00:00:00:00:00)), ('b_vlanid', 0), ('i_sid', 0), ('interface_name', 'VirtualFunctionEthernet91/2/5'), ('interface_dev_type', 'dpdk'), ('tag', '')])]

      When VPP starts, the dpdk plugin complains that /dev/vfio/142 doesn't exist:
      2020/11/10 12:36:03:864 notice dpdk EAL: Detected 64 lcore(s)
      2020/11/10 12:36:03:864 notice dpdk EAL: Detected 2 NUMA nodes
      2020/11/10 12:36:03:864 notice dpdk EAL: Selected IOVA mode 'VA'
      2020/11/10 12:36:03:864 notice dpdk EAL: Probing VFIO support...
      2020/11/10 12:36:03:864 notice dpdk EAL: VFIO support initialized
      2020/11/10 12:36:03:864 notice dpdk EAL: Cannot open /dev/vfio/142: No such device
      2020/11/10 12:36:03:864 notice dpdk EAL: Failed to open group 142
      2020/11/10 12:36:03:864 notice dpdk EAL: Requested device 0000:91:02.0 cannot be used
      2020/11/10 12:36:03:864 notice dpdk EAL: using IOMMU type 1 (Type 1)
      2020/11/10 12:36:03:864 notice dpdk EAL: Probe PCI driver: net_i40e_vf (8086:154c) device: 0000:91:02.1 (socket 1)
      2020/11/10 12:36:03:864 notice dpdk EAL: No legacy callbacks, legacy socket not created

      When this error happens, all subsequent tests fail the same way.

      Another way this can fail:
      2020/11/09 16:05:39:251 notice dpdk EAL: Probe PCI driver: net_i40e_vf (8086:154c) device: 0000:91:02.5 (socket 1)
      2020/11/09 16:05:39:251 notice dpdk i40evf_check_api_version(): PF/VF API version mismatch:(0.0)-(1.1)
      2020/11/09 16:05:39:251 notice dpdk i40evf_init_vf(): check_api version failed
      2020/11/09 16:05:39:251 notice dpdk i40evf_dev_init(): Init vf failed
      2020/11/09 16:05:39:251 notice dpdk EAL: Releasing pci mapped resource for 0000:91:02.5
      2020/11/09 16:05:39:251 notice dpdk EAL: Calling pci_unmap_resource for 0000:91:02.5 at 0x2101014000
      2020/11/09 16:05:39:251 notice dpdk EAL: Calling pci_unmap_resource for 0000:91:02.5 at 0x2101024000
      2020/11/09 16:05:39:251 notice dpdk EAL: Requested device 0000:91:02.5 cannot be used

      This failure only results in one failed test, suggesting some bug in DPDK.

      Yet another way:
      2020/11/25 13:59:21:696 notice dpdk EAL: Detected 64 lcore(s)
      2020/11/25 13:59:21:696 notice dpdk EAL: Detected 2 NUMA nodes
      2020/11/25 13:59:21:696 notice dpdk EAL: Selected IOVA mode 'VA'
      2020/11/25 13:59:21:696 notice dpdk EAL: Probing VFIO support...
      2020/11/25 13:59:21:696 notice dpdk EAL: VFIO support initialized
      2020/11/25 13:59:21:696 notice dpdk EAL: using IOMMU type 1 (Type 1)
      2020/11/25 13:59:21:696 notice dpdk EAL: Getting a vfio_dev_fd for 0000:91:02.0 failed
      2020/11/25 13:59:21:696 notice dpdk EAL: Requested device 0000:91:02.0 cannot be used
      2020/11/25 13:59:21:696 notice dpdk EAL: Cannot open /dev/vfio/143: Device or resource busy
      2020/11/25 13:59:21:696 notice dpdk EAL: Failed to open group 143
      2020/11/25 13:59:21:696 notice dpdk EAL: Requested device 0000:91:02.1 cannot be used
      2020/11/25 13:59:21:696 notice dpdk EAL: Bus (pci) probe failed.

      There is something wrong with /dev/vfio/<n>.

      The issue happens sporadically, much more frequently on aach64 than x86_64. It only happens when there are multiple jobs running, suggesting this happens when multiple VPP instances simultaneously use the same PF (associated with the different VFs the different VPP instances are using). This would be a race condition and it makes sense that it happens more frequently on arm, since those servers are slower, providing a bigger window for overlap where the race condition could happen.

      ADDITIONAL INFORMATION:
      NIC:
      Intel Ethernet Converged Network Adapter XL710-Q2
      cat /sys/class/net/enp5s0f0/device/device
      0x1583
      cat /sys/class/net/enp5s0f0/device/vendor
      0x8086
      cat /sys/class/net/enp5s0f0/device/subsystem_device
      0x0001
      cat /sys/class/net/enp5s0f0/device/subsystem_vendor
      0x8086

      DRIVER/FIRMWARE VERSION:
      driver: i40e
      version: 2.1.14-k
      firmware-version: 6.01 0x800035da 1.1747.0
      expansion-rom-version:
      bus-info: 0000:05:00.0
      supports-statistics: yes
      supports-test: yes
      supports-eeprom-access: yes
      supports-register-dump: yes
      supports-priv-flags: yes

      The PF is using the i40e driver above.
      VFs are using vfio-pci.

      The environment where I've gathered logs from is Ubuntu18.04 with this kernel:
      Linux s55-t36-sut1 4.15.0-55-generic #60-Ubuntu SMP Tue Jul 2 18:23:38 UTC 2019 aarch64 aarch64 aarch64 GNU/Linux

      This also happens on x86 with this kernel (the environment is otherwise the same):
      Linux s1-t11-sut1 4.15.0-23-generic #25-Ubuntu SMP Wed May 23 18:02:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

            juraj.linkes Juraj Linkeš
            juraj.linkes Juraj Linkeš
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: