Uploaded image for project: 'vpp'
  1. vpp
  2. VPP-1753

Occasional socket read failure on some dumps


    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Medium Medium
    • 19.08, 20.01
    • None
    • VPP API Infra
    • None
    • CSIT perf tests

      Seen in CSIT when using PAPI over unix domain socket (forwarded via ssh from remote machine). The symptoms are:
      Transport times out waiting for *_details (or control_ping_reply).
      Not sure whether sockclnt_delete_reply is received.
      Subsequent reconnect attempt (after 1s) is refused.
      Not sure whether VPP is not responding, or the SSH forwarding for the unix domain socket breaks.
      VPP process does not crash.
      Not sure whether there is something logged, at the point tests look at the log, it is already full after VPP is restarted for the next test.

      This is how the failure looks like in the Robot log.html:

      Happens more frequently on ARM platform (or in labs outside LFN).
      This issue is affecting the rls1908 report testing somewhat, but the occurrence is quite low (say 97% of tests pass even with multiple dump calls).
      When the test passes, the dump call is quite quick (0.4s on ARM).

      The issue happens mostly in performance tests. I have seen only one occurrence in devicetest (see the link above), and only for ARM. I was not able to reproduce within "make test" framework.

      I have thought this issue has been already reported by Juraj Linkes, but now I see his e-mail was unicast (to two peers). Copy of relevant part of his e-mail:

      I've been able to identify these Python api calls that fail:
      • memif_dump
      • sw_interface_dump
      • sw_interface_rx_placement_dump
      • sw_interface_vhost_user_dump

      This is probably a problem with the whole class of dump apis.

      Here's some more info:
      • There are hundreds of dump calls and only a few of them actually fail - around 20
      • There's one retry for reach api call. When a dump api call I successful, it seems to always work on the first try (I haven't looked at all calls, but I sampled around 20). I haven't seen a situation where the first try failed, but the second try was successful
      • Vratko mentioned to me that he has seen this happen on x86, but not as frequently
      • This has been happening since the switch to PapiSocketExecutor, Vratko is the author, so adding him to comment
      • There are other failures related to PapiSocketExecutor on x86 which we haven't seen (yet)

      This seems like a problem in PapiSocketExecutor or a problem with VPP Python apis; nothing arm specific. Definitely some sort of race condition is happening.

            vrpolak Vratko Polak
            vrpolak Vratko Polak
            0 Vote for this issue
            2 Start watching this issue