Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Medium
Fix Version/s: 19.08, 20.01
Affects Version/s: None
Component/s: VPP API Infra
Labels:
None
Environment:

CSIT perf tests

Seen in CSIT when using PAPI over unix domain socket (forwarded via ssh from remote machine). The symptoms are:
Transport times out waiting for *_details (or control_ping_reply).
Not sure whether sockclnt_delete_reply is received.
Subsequent reconnect attempt (after 1s) is refused.
Not sure whether VPP is not responding, or the SSH forwarding for the unix domain socket breaks.
VPP process does not crash.
Not sure whether there is something logged, at the point tests look at the log, it is already full after VPP is restarted for the next test.

This is how the failure looks like in the Robot log.html:
https://logs.fd.io/production/vex-yul-rot-jenkins-1/csit-vpp-device-master-ubuntu1804-1n-tx2-vpp-verify-hourly/88/archives/log.html.gz#s1-s1-s1-s2-s3-t1-k2-k8-k1-k4

Happens more frequently on ARM platform (or in labs outside LFN).
This issue is affecting the rls1908 report testing somewhat, but the occurrence is quite low (say 97% of tests pass even with multiple dump calls).
When the test passes, the dump call is quite quick (0.4s on ARM).

The issue happens mostly in performance tests. I have seen only one occurrence in devicetest (see the link above), and only for ARM. I was not able to reproduce within "make test" framework.

I have thought this issue has been already reported by Juraj Linkes, but now I see his e-mail was unicast (to two peers). Copy of relevant part of his e-mail:

I've been able to identify these Python api calls that fail:
• memif_dump
• sw_interface_dump
• sw_interface_rx_placement_dump
• sw_interface_vhost_user_dump

This is probably a problem with the whole class of dump apis.

Here's some more info:
• There are hundreds of dump calls and only a few of them actually fail - around 20
• There's one retry for reach api call. When a dump api call I successful, it seems to always work on the first try (I haven't looked at all calls, but I sampled around 20). I haven't seen a situation where the first try failed, but the second try was successful
• Vratko mentioned to me that he has seen this happen on x86, but not as frequently
• This has been happening since the switch to PapiSocketExecutor, Vratko is the author, so adding him to comment
• There are other failures related to PapiSocketExecutor on x86 which we haven't seen (yet)

This seems like a problem in PapiSocketExecutor or a problem with VPP Python apis; nothing arm specific. Definitely some sort of race condition is happening.

blocks

CSIT-1547 Use socket PAPI also for scale tests

Done

Assignee:: Vratko Polak

Reporter:: Vratko Polak

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 23/Aug/19 5:24 PM

Updated:: 18/Oct/19 5:59 PM

Resolved:: 18/Oct/19 5:59 PM

Details

Description

Attachments

Issue Links

Activity

People

Dates