Uploaded image for project: 'csit'
  1. csit
  2. CSIT-1791

Performance regression in RDMA tests


    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Medium Medium
    • None
    • None
    • None

      There are two symptoms. One is mostly negative comparison (both MRR [0] and PDR [1]) of VPP 21.01-release version when re-tested on 2106 CSIT code. Except memif and vhost tests, the tests show regression from 10% to 20% and more.
      The other symptom is performance capped at ~38 Mpps bidirectional. That can be seen in trending, but also in 4c comparison tables [2].

      The fact this affects also 21.01-release VPP build means this is not a solely VPP bug.
      Investigating daily job history, the last run without capped performance is [3] (the last run before migrating to Ubuntu 20.04), the first run with it is [4] (after migration and some additional fixes, 3 runs in between failed).
      So this is probably an issue with new versions (according to [5], MLX5 Firmware has been upgraded from 16.25.1020 to 16.29.1016 and mlx5_core has been upgraded from 4.6.-1.0.1 to 5.2-1.0.4), or with CSIT changes (maybe interacting with VPP behavior already present in 21.01-release).

      For the second symptom (performance cap), TRex reports all packets as sent, with zero queue_full, so TRex is probably not the guilty party here. The capped tests now show [6] nonzero "no free tx slots" counters, and show run has smaller Vectors in *-output than *-tx node (another signal of loss on TX), although the values do not fully explain the packet loss (comparing directions) as seen from TRex. Still, it looks like there is a TX throttling on DUT, so maybe there also is TX throttling on TRex (but not reported due to different driver).

      For the first symptom (performance decrease on uncapped tests), "no free tx slots" does not happen, and comparison of show run before [7] and after [8] is not clear enough for me to declare which component is really slower. Perhaps it is rdma-input.
      Maybe the two symptoms are two different issues, they just occurred at the same time (due to migration to Ubuntu 20.04) and affected the same tests.

      [0] https://docs.fd.io/csit/rls2106/report/_static/vpp/performance-changes-2n-clx-cx556a-2t1c-mrr.txt
      [1] https://docs.fd.io/csit/rls2106/report/_static/vpp/performance-changes-2n-clx-cx556a-2t1c-pdr.txt
      [2] https://docs.fd.io/csit/rls2106/report/_static/vpp/performance-changes-2n-clx-cx556a-8t4c-mrr.txt
      [3] https://jenkins.fd.io/job/csit-vpp-perf-mrr-daily-master-2n-clx/643/
      [4] https://jenkins.fd.io/job/csit-vpp-perf-mrr-daily-master-2n-clx/647/
      [5] https://gerrit.fd.io/r/c/csit/+/33218/1/docs/lab/testbeds_sm_clx_hw_bios_cfg.md
      [6] https://logs.fd.io/production/vex-yul-rot-jenkins-1/csit-vpp-perf-verify-master-2n-clx/681/log.html.gz#s1-s1-s1-s1-s1-t1-k2-k9-k6-k13-k1-k1-k1-k1
      [7] https://logs.fd.io/production/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-2n-clx/643/archives/log.html.gz#s1-s1-s1-s2-s2-t1-k2-k9-k6-k9-k1-k1-k1-k1-k14-k1-k1-k1-k1
      [8] https://logs.fd.io/production/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-2n-clx/647/archives/log.html.gz#s1-s1-s1-s2-s2-t1-k2-k9-k6-k9-k1-k1-k1-k1-k14-k1-k1-k1-k1

            Unassigned Unassigned
            vrpolak Vratko Polak
            0 Vote for this issue
            1 Start watching this issue