-
Bug
-
Resolution: Unresolved
-
High
-
None
-
None
This is a mix of symptoms, if it becomes likely they have different causes I will split this.
One symptom (timeout waiting for privilege escalation prompt, causing unreservation failure) was seen before, but I do not see a separate ticket for it.
Some symptoms only cause test failures, other cause the testbed to get stuck in reserved state.
Time line for this weekend is below, here are my initial observations.
This issue seem to be different from CSIT-1848 and CSIT-1897, also unlikely to be triggered by Gerrit 37590.
Very suspicious is the time of first unexpected symptom (broken PAPI pipe) less then a minute after midnight, which is where weekly jobs are starting.
My current hypothesis is that the first failure was caused by aarch64 executor (3n-tsh and 2n-tx2 iterative jobs were not running then) getting overloaded, which lead to various timeouts leaving DUTs in a state our ansible cleanup was not taught to repair.
Here is the time line:
iter-6: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2302-3n-alt/6/console-timestamp.log.gz
reserved: 2023-02-10 19:52:26
unreserved: 2023-02-10 22:11:25
notes: last normal run before issues started
daily-164: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-alt/164/console-timestamp.log.gz
reserved: 2023-02-10 22:12:02
unreservation failed, run ended 2023-02-11 06:50:10
notes:
+ VPP irreparably broke, first symptom at 2023-02-11 00:00:23
+ Unreservation failure reason: timeout for privilege escalation prompt on 10.30.51.73
(testbed stuck in reserved state until manual intervention 1)
iter-8: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2302-3n-alt/8/console-timestamp.log.gz
reserved: 2023-02-11 19:54:45
unreserved: 2023-02-11 21:03:34
note: all failed due to: Failed to unbind PCI device 0004:04:00.1 on 10.30.51.73
iter-9: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2302-3n-alt/9/console-timestamp.log.gz
reserved: 2023-02-11 21:03:59
unreserved: 2023-02-11 22:12:26
note: Failed to unbind PCI device 0004:04:00.1 on 10.30.51.73
weekly-45: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-dpdk-perf-mrr-weekly-master-3n-alt/45/console-timestamp.log.gz
reserved: 2023-02-11 22:14:15
unreservation failed, run ended 2023-02-11 23:29:47
notes:
+ no dpdk app started on 10.30.51.73
+ unreservation failed due to prompt on 10.30.51.73
(testbed stuck in reserved state until manual intervention 2)
iter-7: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2302-3n-alt/7/console-timestamp.log.gz
reserved: 2023-02-13 05:27:51
unreservation failed, run ended 2023-02-13 06:02:09
notes:
+ unreservation failed due to prompt on 10.30.51.73
+ VPP failed to start in the top suite setup, some PCI related errors in log:
++ vlib_pci_get_device_info: invalid PCI config for `/sys/bus/pci/devices/0004:04:00.1/config'
++ Unsupported PCI device 0x8086:0x0435 found at PCI address 0005:03:00.0
(testbed stuck in reserved state until manual intervention 3)
verify-152: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-dpdk-perf-verify-master-3n-alt/152/console-timestamp.log.gz
reserved: 2023-02-13 10:42:11
unreserved: 2023-02-13 21:16:28
notes: all works except CSIT-1848 (only visible for 4c tests of l3fwd)
iter-11: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2302-3n-alt/11/console-timestamp.log.gz
notes:
+ Failed before even reservation, build timeout on apt-get update (for packagecloud download).
+ Timeout duration 2023-02-13 17:55:50 - 23:55:50 (started during verify-152).
iter-10: https://jenkins.fd.io/view/csit/job/csit-vpp-perf-report-iterative-2302-3n-alt/10/consoleFull
reserved: 2023-02-13 21:17:35
notes: still running, few CSIT-1897 failures (expected due verify-152), then all normal