Uploaded image for project: 'vpp'
  1. vpp
  2. VPP-526

VPP Crash in host_user_if_input

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Open
    • Icon: Medium Medium
    • None
    • None
    • None
    • None

      Summary:

      VPP crashes in vhost_user_if_input. This happens when using multiple threads and queues, and rebooting the system while send traffic.

      The traceback is as follows:

      (gdb) bt
      #0 0x00007ffff70671d0 in vhost_user_if_input (vm=0x7fffb6d64648, vum=0x7ffff747ee20 <vhost_user_main>, vui=0x7fffb6ec75cc, node=0x7fffb6e06cb8) at /scratch/myciscoatt/src/vpp/build-data/../vnet/vnet/devices/virtio/vhost-user.c:1150
      #1 0x00007ffff7067df5 in vhost_user_input (vm=0x7fffb6d64648, node=0x7fffb6e06cb8, f=0x0) at /scratch/myciscoatt/src/vpp/build-data/../vnet/vnet/devices/virtio/vhost-user.c:1361
      #2 0x00007ffff74d4fff in dispatch_node (vm=0x7fffb6d64648, node=0x7fffb6e06cb8, type=VLIB_NODE_TYPE_INPUT, dispatch_state=VLIB_NODE_STATE_POLLING, frame=0x0, last_time_stamp=116904157643429) at /scratch/myciscoatt/src/vpp/build-data/../vlib/vlib/main.c:996
      #3 0x00007ffff751a2e1 in vlib_worker_thread_internal (vm=0x7fffb6d64648) at /scratch/myciscoatt/src/vpp/build-data/../vlib/vlib/threads.c:1389
      #4 0x00007ffff751a5c3 in vlib_worker_thread_fn (arg=0x7fffb5427c70) at /scratch/myciscoatt/src/vpp/build-data/../vlib/vlib/threads.c:1455
      #5 0x00007ffff62b0314 in clib_calljmp () at /scratch/myciscoatt/src/vpp/build-data/../vppinfra/vppinfra/longjmp.S:110
      #6 0x00007fff768bcc00 in ?? ()
      #7 0x00007ffff7515a44 in vlib_worker_thread_bootstrap_fn (arg=0x7fffb5427c70) at /scratch/myciscoatt/src/vpp/build-data/../vlib/vlib/threads.c:516
      Backtrace stopped: previous frame inner to this frame (corrupt stack?)
      (gdb) p/x txvq
      $10 = 0x7fffb6ec7954
      (gdb) p/x *txvq
      $11 =

      {qsz = 0x100, last_avail_idx = 0x0, last_used_idx = 0x0, desc = 0x7fd8a4630000, avail = 0x7fd8a4631000, used = 0x7fd8a4632000, log_guest_addr = 0x424632000, callfd = 0x56, kickfd = 0x5e, errfd = 0x0, enabled = 0x1, log_used = 0x0, callfd_idx = 0x11, n_since_last_int = 0x0, int_deadline = 0x52d6}

      The pointers look good when examined in gdb so this points to a race condition. The race condition might be that the memory is not available when system is rebooted. VPP

      Dave Barach looked at this and here is his summary:

      Guys,

      John D. asked me to take a look at a multiple-worker, multiple-queue vhost_user crash scenario. After some fiddling, I found a scenario that’s 100% reproducible. With vpp provisioned by the ML2 plugin [or whatever calls itself “test_papi”], ssh into the compute vm and type “sudo /sbin/reboot”.

      This scenario causes a mild vhost_user shared-memory earthquake with traffic flowing.

      One of the worker threads will receive SIGSEGV, right here:

      /* vhost_user_if_input, at or near line 1142 */

      u32 next_desc =
      txvq->avail->ring[(txvq->last_avail_idx + 1) & qsz_mask];

      By the time one can look at the memory reference in gdb, the memory is accessible. My guess: qemu briefly changes protections on the vhost_user shared-memory segment, yadda yadda yadda.

      This scenario never causes an issue when running single-queue, single-core.

      An API trace - see below - indicates that vpp receives no notification of any kind. There isn’t a hell of lot that the vhost_user driver can do to protect itself.

      Time for someone to stare at the quemu code, I guess...

      HTH… Dave

      1. api trace custom-dump /tmp/twoboot
        SCRIPT: memclnt_create name test_papi
        SCRIPT: sw_interface_dump all
        SCRIPT: control_ping
        SCRIPT: sw_interface_dump all
        SCRIPT: control_ping
        SCRIPT: sw_interface_set_flags sw_if_index 1 admin-up link-up
        SCRIPT: bridge_domain_add_del bd_id 5678 flood 1 uu-flood 1 forward 1 learn 1 arp-term 0
        SCRIPT: sw_interface_set_l2_bridge sw_if_index 1 bd_id 5678 shg 0 enable
        SCRIPT: tap_connect tapname vppef940067-0b mac fa:16:3e:6e:22:41
        SCRIPT: sw_interface_set_flags sw_if_index 4 admin-up link-up
        SCRIPT: sw_interface_set_l2_bridge sw_if_index 4 bd_id 5678 shg 0 enable
        SCRIPT: sw_interface_dump all
        SCRIPT: control_ping
        SCRIPT: sw_interface_dump all
        SCRIPT: control_ping
        SCRIPT: sw_interface_set_flags sw_if_index 3 admin-up link-up
        SCRIPT: bridge_domain_add_del bd_id 5679 flood 1 uu-flood 1 forward 1 learn 1 arp-term 0
        SCRIPT: sw_interface_set_l2_bridge sw_if_index 3 bd_id 5679 shg 0 enable
        SCRIPT: create_vhost_user_if socket /tmp/52970d78-dad3-4887-b4bf-df90d3e13602
        SCRIPT: sw_interface_set_flags sw_if_index 5 admin-up link-up
        SCRIPT: sw_interface_set_l2_bridge sw_if_index 5 bd_id 5679 shg 0 enable
        SCRIPT: create_vhost_user_if socket /tmp/92473e06-ea98-4b4f-80df-c9bb702c3885
        SCRIPT: sw_interface_set_flags sw_if_index 6 admin-up link-up
        SCRIPT: sw_interface_set_l2_bridge sw_if_index 6 bd_id 5678 shg 0 enable
        SCRIPT: sw_interface_dump all
        SCRIPT: control_ping
        SCRIPT: sw_interface_dump all
        SCRIPT: control_ping
        SCRIPT: sw_interface_set_flags sw_if_index 2 admin-up link-up
        SCRIPT: bridge_domain_add_del bd_id 5680 flood 1 uu-flood 1 forward 1 learn 1 arp-term 0
        SCRIPT: sw_interface_set_l2_bridge sw_if_index 2 bd_id 5680 shg 0 enable
        SCRIPT: create_vhost_user_if socket /tmp/e2261ff9-4953-4368-a8c9-8005ccf0e896
        SCRIPT: sw_interface_set_flags sw_if_index 7 admin-up link-up
        SCRIPT: sw_interface_set_l2_bridge sw_if_index 7 bd_id 5680 shg 0 enable
        SCRIPT: create_vhost_user_if socket /tmp/b5d9c5f0-0494-4bd0-bb28-437f5261fad5
        SCRIPT: sw_interface_set_flags sw_if_index 8 admin-up link-up
        SCRIPT: sw_interface_set_l2_bridge sw_if_index 8 bd_id 5679 shg 0 enable
        SCRIPT: tap_connect tapname vppb7464b44-11 mac fa:16:3e:66:31:79
        SCRIPT: sw_interface_set_flags sw_if_index 9 admin-up link-up
        SCRIPT: sw_interface_set_l2_bridge sw_if_index 9 bd_id 5680 shg 0 enable
        SCRIPT: tap_connect tapname vppab16509a-c5 mac fa:16:3e:c2:9f:ac
        SCRIPT: sw_interface_set_flags sw_if_index 10 admin-up link-up
        SCRIPT: sw_interface_set_l2_bridge sw_if_index 10 bd_id 5679 shg 0 enable
        SCRIPT: create_vhost_user_if socket /tmp/783d34a8-3e72-4434-97cf-80c7e199e66c
        SCRIPT: sw_interface_set_flags sw_if_index 11 admin-up link-up
        SCRIPT: sw_interface_set_l2_bridge sw_if_index 11 bd_id 5678 shg 0 enable
        SCRIPT: create_vhost_user_if socket /tmp/67a02881-e241-4ae4-abb4-dfa03e951772
        SCRIPT: sw_interface_set_flags sw_if_index 12 admin-up link-up
        SCRIPT: sw_interface_set_l2_bridge sw_if_index 12 bd_id 5680 shg 0 enable
        SCRIPT: memclnt_create name vpp_api_test # connect vpp_api_test prior to rebooting vm, as described
        SCRIPT: sw_interface_dump name_filter Ether
        SCRIPT: sw_interface_dump name_filter lo
        SCRIPT: sw_interface_dump name_filter pg
        SCRIPT: sw_interface_dump name_filter vxlan_gpe
        SCRIPT: sw_interface_dump name_filter vxlan
        SCRIPT: sw_interface_dump name_filter host
        SCRIPT: sw_interface_dump name_filter l2tpv3_tunnel
        SCRIPT: sw_interface_dump name_filter gre
        SCRIPT: sw_interface_dump name_filter lisp_gpe
        SCRIPT: sw_interface_dump name_filter ipsec
        SCRIPT: control_ping
        SCRIPT: get_first_msg_id lb_16c904aa
        SCRIPT: get_first_msg_id snat_aa4c5cd5
        SCRIPT: get_first_msg_id pot_e4aba035
        SCRIPT: get_first_msg_id ioam_trace_a2e66598
        SCRIPT: get_first_msg_id ioam_export_eb694f98
        SCRIPT: get_first_msg_id flowperpkt_789ffa7b
        SCRIPT: cli_request
        vl_api_memclnt_delete_t:
        index: 269
        handle: 0x305e16c0
        REBOOT THE VM RIGHT HERE
        Absolutely nothing to indicate that anything happened
        SCRIPT: memclnt_create name vpp_api_test # connect vpp_api_test again
        SCRIPT: sw_interface_dump name_filter Ether
        SCRIPT: sw_interface_dump name_filter lo
        SCRIPT: sw_interface_dump name_filter pg
        SCRIPT: sw_interface_dump name_filter vxlan_gpe
        SCRIPT: sw_interface_dump name_filter vxlan
        SCRIPT: sw_interface_dump name_filter host
        SCRIPT: sw_interface_dump name_filter l2tpv3_tunnel
        SCRIPT: sw_interface_dump name_filter gre
        SCRIPT: sw_interface_dump name_filter lisp_gpe
        SCRIPT: sw_interface_dump name_filter ipsec
        SCRIPT: control_ping
        SCRIPT: get_first_msg_id lb_16c904aa
        SCRIPT: get_first_msg_id snat_aa4c5cd5
        SCRIPT: get_first_msg_id pot_e4aba035
        SCRIPT: get_first_msg_id ioam_trace_a2e66598
        SCRIPT: get_first_msg_id ioam_export_eb694f98
        SCRIPT: get_first_msg_id flowperpkt_789ffa7b
        SCRIPT: cli_request
        vl_api_memclnt_delete_t:
        index: 269
        handle: 0x305e16c0
        DBGvpp#

            Unassigned Unassigned
            jdenisco John DeNisco
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: