Uploaded image for project: 'vpp'
  1. vpp
  2. VPP-1361

High failure rate of api call sw_interface_set_flags [admin-up|link-up]

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Open
    • Icon: High High
    • None
    • 18.07
    • None
    • None

      Summary:
      Observing a high failure rate up to 50% of VPP interface link up operation while initializing interfaces after VPP startup. Affects all CSIT physical testbeds (Haswell, Skylake) and all interface types (DPDK, AVF, vhost, memif), although it's most pronounced for DPDK and AVF interfaces as these are activated first.

      Description:
      In each CSIT test we are using a following sequence to bring interfaces up:

      • First bring the both interfaces (physical up) by VAT:
          - sw_interface_set_flags sw_if_index <idx> admin-up.
          - We also tried sw_interface_set_flags sw_if_index idx admin-up link-up.
      • After setting all interfaces UP we are testing if interfaces are actually UP using VAT:
          - sw_interface_dump API calls in a FOR loop, running 30 times, 1s WAIT in-between.
      • In many tests{{sw_interface_dump}} keeps reporting interfaces as link_down (admin-up) despite up to 30sec wait time.
      • Happens for VPP physical DPDK interfaces, vhost/memif/AVF interfaces too.
      • Happens in cases where the interface is activated with MTU setting and without MTU setting (i.e. using default).
      • This issues got only observed recently, wasn't there in the past.

      More Details:

      • sw_interface_dump check is running 30x (1s interval) in a FOR loop (increased to 60x 1s interval without effect).
      • Link-down is random, sometimes both interfaces are link-up, sometimes just one is link-up/link-down and sometimes both links are down.
      • Issue observed is not testbed related, nor cabling related. We see it on 3Node-Haswell tests in ~ 1% of tests, 2Node-Skylake 1% of tests, but on 3Node-Skylake over 40-50% almost up to 90% tests are showing the symptoms.
      • Checking the state of interface during test revealed that interfaces are link-down (via show interface cli), so sw_interface_dump API is reporting state correctly.
      • Issue observed affecting all NICs, but due to test/NIC coverage spotted mostly on x520 and x710, (we have also xxv710, xl710).
      • Issue observed in VPP master and stable/1807 branches.
      • Dut to sporadic nature of the sysmptoms, it is not clear when the issue came about, bisecting could take significant time.
      • Issue has been sporadically spotted on VIRL in the past (mainly Centos which we ignored as it was not clear if that was Centos related).
      • Issue has been also observed on Memif/Vhost/AVF interfaces as well, but due to lower probability on Haswell plus lower coverage of such test vs. Phy NICs it is less frequent and harder to reproduce. (Mainly we are catching Phy NICs first due to tests design).
      • We have tried to switch to dpdk1802 but this does not resolve the issue.

      Thanks to Vratko we have capture core dumps (although vpp did not crash). Please see attachment and comments below. Also be aware to huge size after extract (~160GB).

      Sample logs from CSIT:
      https://logs.fd.io/production/vex-yul-rot-jenkins-1/csit-vpp-perf-verify-1807-3n-hsw/5/archives/log.html.gz#s1-s1-s1-s1-s24-t2-k2-k8-k1-k2-k1

      Same test but sh hard + sh int:
      https://logs.fd.io/production/vex-yul-rot-jenkins-1/csit-vpp-perf-verify-1807-3n-hsw/5/archives/log.html.gz#s1-s1-s1-s1-s24-t2-k3-k1-k4-k1-k1-k6-k1

       VAT command history:
      sw_interface_set_flags sw_if_index 2 admin-up
      hw_interface_set_mtu sw_if_index 2 mtu 9200
      sw_interface_set_flags sw_if_index 1 admin-up
      hw_interface_set_mtu sw_if_index 1 mtu 9200
      sw_interface_dump
      sw_interface_dump
      sw_interface_dump
      sw_interface_dump
      sw_interface_dump
      sw_interface_dump
      sw_interface_dump
      sw_interface_dump
      sw_interface_dump

      DUT1/DUT2 /etc/vpp/startup.conf:

      ip
      {
        heap-size 4G
      }
      statseg
      {
        size 4G
      }
      unix
      {
        cli-listen /run/vpp/cli.sock
        log /tmp/vpe.log
        nodaemon 
      }
      ip6
      {
        heap-size 4G
        hash-buckets 2000000
      }
      heapsize 4G
      plugins
      {
        plugin default
        {
          disable  
        }
        plugin acl_plugin.so
        {
          enable  
        }
        plugin dpdk_plugin.so
        {
          enable  
        }
      }
      cpu
      {
        corelist-workers 2,3
        main-core 1
      }
      dpdk
      {
        dev 0000:0a:00.0 
        dev 0000:0a:00.1 
        num-mbufs 16384
        uio-driver igb_uio
        log-level debug
        dev default
        {
          num-rx-queues 1
        }
        socket-mem 1024,1024
        no-tx-checksum-offload 
        no-multi-seg 
      }

            Unassigned Unassigned
            pmikus Peter Mikus
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: