Uploaded image for project: 'csit'
  1. csit
  2. CSIT-1955

2n-spr: nginx regression around 2024-04-06 seems infra related

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Medium Medium
    • rls2406
    • None
    • None

      Trending: [0]. Throughput and bandwidth go down, latency goes up. Last good run: [1]. First bad run: [2]. The issue appeared over weekend, possibly an interference from some weekly job. CSIT code was identical (Monday run happens before we create new oper branch). VPP had 4 more changes, neither one looks suspicious to me. I was not able to reach the old performance even using rls2402 CSIT with 24.02-release VPP. Not seen affecting any other test than nginx.
      Testbeds other than 2n-spr are not affected, both nics on 2n-spr are affected. Rps tests are also affected, but they have more noisy results, so the point of regression is less clear there.
      There are two 2n-spr testbeds, both are present in trending so both got affected roughly at the same time.

      We were not doing any planned changes to the testbeds. There is one infra-related change visible though. All old runs (I checked up to #196) have the same "show pci" output [7]:

      Address Sock VID:PID Link Speed Driver Product Name Vital Product Data
      0000:17:00.0 0 8086:1593 unknown ice Intel(R) Ethernet Network Adapte V1: 0x 49 6e 74 65 6c 28 52 29 ...
      PN: K92046-010
      SN: 507C6F484644
      V2: 0x 35 31 32 32
      RV: 0x 15
      0000:17:00.1 0 8086:1593 unknown ice Intel(R) Ethernet Network Adapte V1: 0x 49 6e 74 65 6c 28 52 29 ...
      PN: K92046-010
      SN: 507C6F484644
      V2: 0x 35 31 32 32
      RV: 0x 15
      0000:17:00.2 0 8086:1593 unknown vfio-pci Intel(R) Ethernet Network Adapte V1: 0x 49 6e 74 65 6c 28 52 29 ...
      PN: K92046-010
      SN: 507C6F484644
      V2: 0x 35 31 32 32
      RV: 0x 15
      0000:17:00.3 0 8086:1593 unknown vfio-pci Intel(R) Ethernet Network Adapte V1: 0x 49 6e 74 65 6c 28 52 29 ...
      PN: K92046-010
      SN: 507C6F484644
      V2: 0x 35 31 32 32
      RV: 0x 15
      0000:17:01.0 0 8086:1889 unknown <NONE>
      0000:17:09.0 0 8086:1889 unknown <NONE>
      0000:2a:00.0 0 8086:1592 unknown vfio-pci Intel(R) Ethernet Network Adapte V1: 0x 49 6e 74 65 6c 28 52 29 ...
      PN: K87758-009
      SN: 40A6B79EE998
      V2: 0x 32 37 32 32
      RV: 0x 58
      0000:2c:00.0 0 8086:1592 unknown vfio-pci Intel(R) Ethernet Network Adapte V1: 0x 49 6e 74 65 6c 28 52 29 ...
      PN: K87758-009
      SN: 40A6B79EE99C
      V2: 0x 32 37 32 32
      RV: 0x 4d
      0000:3d:00.0 0 15b3:1021 unknown mlx5_core NVIDIA ConnectX-7 Ethernet adapt PN: MCX713106AS-VEAT
      EC: A6
      V2: 0x 4d 43 58 37 31 33 31 30 ...
      SN: MT2244XZ027D
      V3: 0x 39 38 66 38 34 30 61 32 ...
      VA: 0x 4d 4c 58 3a 4d 4e 3d 4d ...
      V0: 0x 50 43 49 65 47 65 6e 35 ...
      VU: 0x 4d 54 32 32 34 34 58 5a ...
      RV: 0x 02 00
      0000:3d:00.1 0 15b3:1021 unknown mlx5_core NVIDIA ConnectX-7 Ethernet adapt PN: MCX713106AS-VEAT
      EC: A6
      V2: 0x 4d 43 58 37 31 33 31 30 ...
      SN: MT2244XZ027D
      V3: 0x 39 38 66 38 34 30 61 32 ...
      VA: 0x 4d 4c 58 3a 4d 4e 3d 4d ...
      V0: 0x 50 43 49 65 47 65 6e 35 ...
      VU: 0x 4d 54 32 32 34 34 58 5a ...
      RV: 0x 01 00
      0000:50:00.0 0 8086:1563 unknown ixgbe
      0000:50:00.1 0 8086:1563 unknown ixgbe

      When the regression happened, a new output appeared, but since then both old and new output are visible (on both testbeds), so the real cause is permanent even if "show pci" is not. Not sure if the differences are caused by some other job or if they are caused by reboots.
      The new output [6]:

      Address Sock VID:PID Link Speed Driver Product Name Vital Product Data
      0000:17:00.0 0 8086:1593 unknown vfio-pci Intel(R) Ethernet Network Adapte V1: 0x 49 6e 74 65 6c 28 52 29 ...
      PN: K92046-010
      SN: 507C6F484570
      V2: 0x 35 31 32 32
      RV: 0x 17
      0000:17:00.1 0 8086:1593 unknown vfio-pci Intel(R) Ethernet Network Adapte V1: 0x 49 6e 74 65 6c 28 52 29 ...
      PN: K92046-010
      SN: 507C6F484570
      V2: 0x 35 31 32 32
      RV: 0x 17
      0000:17:00.2 0 8086:1593 unknown vfio-pci Intel(R) Ethernet Network Adapte V1: 0x 49 6e 74 65 6c 28 52 29 ...
      PN: K92046-010
      SN: 507C6F484570
      V2: 0x 35 31 32 32
      RV: 0x 17
      0000:17:00.3 0 8086:1593 unknown vfio-pci Intel(R) Ethernet Network Adapte V1: 0x 49 6e 74 65 6c 28 52 29 ...
      PN: K92046-010
      SN: 507C6F484570
      V2: 0x 35 31 32 32
      RV: 0x 17
      0000:2a:00.0 0 8086:1592 unknown vfio-pci Intel(R) Ethernet Network Adapte V1: 0x 49 6e 74 65 6c 28 52 29 ...
      PN: K87758-009
      SN: 40A6B79EE270
      V2: 0x 32 37 32 32
      RV: 0x 69
      0000:2c:00.0 0 8086:1592 unknown vfio-pci Intel(R) Ethernet Network Adapte V1: 0x 49 6e 74 65 6c 28 52 29 ...
      PN: K87758-009
      SN: 40A6B79EE274
      V2: 0x 32 37 32 32
      RV: 0x 65
      0000:3d:00.0 0 15b3:1021 unknown mlx5_core NVIDIA ConnectX-7 Ethernet adapt PN: MCX713106AS-VEAT
      EC: A6
      V2: 0x 4d 43 58 37 31 33 31 30 ...
      SN: MT2244XZ026T
      V3: 0x 33 65 63 62 63 37 31 35 ...
      VA: 0x 4d 4c 58 3a 4d 4e 3d 4d ...
      V0: 0x 50 43 49 65 47 65 6e 35 ...
      VU: 0x 4d 54 32 32 34 34 58 5a ...
      RV: 0x b8 00
      0000:3d:00.1 0 15b3:1021 unknown mlx5_core NVIDIA ConnectX-7 Ethernet adapt PN: MCX713106AS-VEAT
      EC: A6
      V2: 0x 4d 43 58 37 31 33 31 30 ...
      SN: MT2244XZ026T
      V3: 0x 33 65 63 62 63 37 31 35 ...
      VA: 0x 4d 4c 58 3a 4d 4e 3d 4d ...
      V0: 0x 50 43 49 65 47 65 6e 35 ...
      VU: 0x 4d 54 32 32 34 34 58 5a ...
      RV: 0x b7 00
      0000:50:00.0 0 8086:1563 unknown ixgbe
      0000:50:00.1 0 8086:1563 unknown ixgbe

      Note that the old output is the less wrong one, all four 0000:17:00.* Intel-E810XXV ports should be bound to ICE driver. The difference is perhaps only in whether an e810xxv test has been executed on the testbed since the last reboot. But that should not matter, as the regression is observed on Intel-E81CQ instead (0000:2a:00.0 and 0000:2c:00.0) and for those the input is identical (and wrong, ICE should be bound).

      VPP (non-hoststack) periodic jobs are probably unaffected just because the first testcase is for AVF driver. So even if the "show pci" output is wrong [3] in global suite setup, in the first test it is already good [4], probably because it got fixed in local suite setup [5].

      Still, that does not explain why there was a regression. But maybe fixing the binding in dpdk_plugin test suites will also fix the nginx performance?

      [0] https://csit.fd.io/trending/#eNqNkEEOgjAQRU9TN2aStkpw40LlHqS0IxARxnY06ulFYxxcmLhpF6-_7-cnHiKWCbu1yrYq3yqbt2E81GIzH68LEdgeEkUwWtdoyeDKaH-CQOEAzZA4sfMH0BUYD8gNtLRkTw0zQRcoYje4AH3d9lcwpTVlBp7S02B3T0M485dOCDU3Ib9LSMBFdJL4dJMHjGki-6ushPfRHTG1d5QfdCXYjzsKMf7byjea0PcAeaGyYtYP8fgaPyserQByKQ
      [1] https://jenkins.fd.io/job/csit-vpp-perf-hoststack-daily-master-2n-spr/209/
      [2] https://jenkins.fd.io/job/csit-vpp-perf-hoststack-daily-master-2n-spr/210/
      [3] https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-2n-spr/227/log.html.gz#s1-s1-s1-k1-k6
      [4] https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-2n-spr/227/log.html.gz#s1-s1-s1-s1-s1-t1-k2-k5-k1-k1-k1-k1
      [5] https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-2n-spr/227/log.html.gz#s1-s1-s1-s1-s1-k1-k5-k3-k1-k1-k1-k1-k1-k1-k1-k1-k1-k1-k1
      [6] https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-hoststack-daily-master-2n-spr/210/log.html.gz#s1-s1-s1-s1-s1-t1-k2-k7-k1-k1-k1-k1
      [7] https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-hoststack-daily-master-2n-spr/209/log.html.gz#s1-s1-s1-s1-s1-t1-k2-k7-k1-k1-k1-k1

            vrpolak Vratko Polak
            vrpolak Vratko Polak
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: