Uploaded image for project: 'vpp'
  1. vpp
  2. VPP-1947

Simpler processing is less efficient with more workers

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Medium Medium
    • None
    • None
    • None
    • None

      This is not a new issue, but I probably have not opened a Jira ticket for this yet.

      It seems simpler processing (tests with higher throughput when 1 core is used) start showing performance degradation, eventually reaching lower throughput (than other tests with similar but less simple processing) when multiple cores are used.

      Example graphs from 2009 report: [0].
      As far as I can tell, this affects all NICs (if the performance is not higher than hardware limits), all divers (RDMA, DPDK, AVF) and all architectures (they only differ in the amount of cores where the regression starts).

      I have prepared a small test run [1], on Haswell as other testbeds are busy. Haswell is 3-node testbed (no hyperthreading), so outputs are from 2 VPP boxes, and traffic is not entirely symmetric on them (one directions has less packets, as some were already lost on the other box). I will collect more tests later.
      L2patch is faster than l2bdbasemaclrn with 2 cores, but slower with 4 cores. It affects both MRR and NDR/PDR results. Both tests use the same traffic profile, only VPP configuration is different.

      Looking at "show run" [2], there is some mismatch between *-output and *-tx nodes, but not big enough to explain the performance. Number of processed packets per cycle is low, so ideally no loss should happen.
      Looking at statistics [3] after measurements, I see rx_dropped_packets large enough to explain the performance. But I see no good reason why RX buffers should get full with this small number of cycles per packets.

      The only explanation I see is that more frequent polling somehow causes the packet loss (as if reading from RX queue applies a lock, preventing NIC from adding packets there). Is that possible? If yes, are there any recommended workarounds? If no, is there a bug in VPP to fix?

      [0] https://docs.fd.io/csit/rls2009/report/vpp_performance_tests/throughput_speedup_multi_core/l2-2n-clx-xxv710.html
      [1] https://logs.fd.io/production/vex-yul-rot-jenkins-1/csit-vpp-perf-verify-master-3n-hsw/1161/archives/log.html.gz#s1-s1-s1-s1
      [2] https://logs.fd.io/production/vex-yul-rot-jenkins-1/csit-vpp-perf-verify-master-3n-hsw/1161/archives/log.html.gz#s1-s1-s1-s1-s3-t2-k2-k9-k6-k7-k1-k1-k1-k1-k12-k1-k1-k1-k1
      [3] https://logs.fd.io/production/vex-yul-rot-jenkins-1/csit-vpp-perf-verify-master-3n-hsw/1161/archives/log.html.gz#s1-s1-s1-s1-s3-t2-k2-k9-k6-k10-k1-k1-k1-k1

            Unassigned Unassigned
            vrpolak Vratko Polak
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: