Uploaded image for project: 'csit'
  1. csit
  2. CSIT-1848

3n-alt: testpmd tests fail due DUT-DUT link taking long to come up

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Medium Medium
    • close old issues
    • None
    • None

      rca: 
         test: testpmd, l3fwd
         priority: medium
         frequency: all
         testbed: 3n-alt, 3n-snr
         example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-dpdk-perf-mrr-weekly-master-3n-alt/29/log.html.gz#s1-s1-s1-s1-t1

      https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-dpdk-perf-report-iterative-2206-3n-alt/9/log.html.gz#s1-s1-s1-s1-t1  

       

      https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-dpdk-perf-report-iterative-2210-3n-snr/6/log.html.gz#s1-s1-s1-s1-t1

       

      https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-dpdk-perf-report-iterative-2302-3n-alt/1/log.html.gz#s1-s1-s1-s1-t1

       

      The Altra servers are two socket and the topology is TG -> DUT1 -> DUT2 -> TG, traffic flows in both directions, but nothing gets forwarded (with a slight caveat - put a pin in this). There's nothing special in the tests, just forwarding traffic. The NIC we're testing is xl710-QDA2.

       

      The same tests are passing on all other testbeds - we have various two node (1 DUT, 1 TG) and three node (2 DUT, 1 TG) Intel and Arm testbeds and with various NICs (Intel 700 and 800 series and the Intel testbeds use some Mellanox NICs as well). We don't have quite the same combination of another three node topology with the same NIC though, so it looks like something with testpmd/l3fwd and xl710-QDA2 on Altra servers.
      The one other testbed that possibly has the problem is the 3 node Snowridge testbeds.

       

      VPP performance tests are passing, but l3fwd and testpmd fail. This leads us to believe to it's a software issue, but there could something wrong with the hardware. I'll talk about testpmd from now on, but as far we can tell, the behavior is the same for testpmd and l3fwd.

       

      Getting back to the caveat mentioned earlier, there seems to be something wrong with port shutdown. When running testpmd on a testbed that hasn't been used for a while it seems that all ports are up right away (we don't see any "Port 0|1: link state change event") and the setup works fine (forwarding works). After restarting testpmd (restarting on one server is sufficient), the ports between DUT1 and DUT2 (but not between DUTs and TG) go down and are not usable in DPDK, VPP or in Linux (with i40e kernel driver) for a while (measured in minutes, sometimes dozens of minutes; the duration is seemingly random). The ports eventually recover and can be used again, but there's nothing in syslog suggesting what happened.

       

      What seems to be happening is testpmd put the ports into some faulty state. This only happens on the DUT1 -> DUT2 link though (the ports between the two testpmds), not on TG -> DUT1 link (the TG port is left alone).

       

      Some more info:

      We've come across the issue with this configuration:

      OS: Ubuntu20.04 with kernel 5.4.0-65-generic.

      Old NIC firmware, never upgraded: 6.01 0x800035da 1.1747.0.

      Drivers versions: i40e 2.17.15 and iavf 4.3.19.

       

      As well as with this configuration:

      OS: Ubuntu22.04 with kernel 5.15.0-46-generic.

      Updated firmware: 8.30 0x8000a4ae 1.2926.0.

      Drivers: i40e 2.19.3 and iavf 4.5.3.

       

      Unsafe noiommu mode is disabled:

      cat /sys/module/vfio/parameters/enable_unsafe_noiommu_mode

      N

       

      We used DPDK 22.07 in manual testing and built it on DUTs, using generic build:

      meson -Dexamples=l3fwd -Dc_args=-DRTE_LIBRTE_I40E_16BYTE_RX_DESC=y -Dplatform=generic build

       

      We're running testpmd with this command:

      sudo build/app/dpdk-testpmd -v -l 1,2 -a 0004:04:00.1 -a 0004:04:00.0 --in-memory – -i --forward-mode=io --burst=64 --txq=1 --rxq=1 --tx-offloads=0x0 --numa --auto-start --total-num-mbufs=32768 --nb-ports=2 --portmask=0x3 --max-pkt-len=1518 --mbuf-size=16384 --nb-cores=1

       

      And l3fwd (with different macs on the other server):

      sudo /tmp/openvpp-testing/dpdk/build/examples/dpdk-l3fwd -v -l 1,2 -a 0004:04:00.0 -a 0004:04:00.1 --in-memory – --parse-ptype --eth-dest="0,40:a6:b7:85:e7:79" --eth-dest="1,3c:fd:fe:c3:e7:a1" --config="(0, 0, 2),(1, 0, 2)" -P -L -p 0x3

       

      We tried adding logs with  --log-level=pmd,debug and --no-lsc-interrupt, but that didn't reveal anything helpful, as far as we can tell - please have a look at the attached log. The faulty port is port0 (starts out as down, then we waited for around 25 minutes for it to go up and then we shut down testpmd).

            juraj.linkes Juraj Linkeš
            vluc Viliam Luc
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: