-
Bug
-
Resolution: Open
-
Medium
-
None
-
None
-
None
-
None
Summary:
VPP crashes in vhost_user_if_input. This happens when using multiple threads and queues, and rebooting the system while send traffic.
The traceback is as follows:
(gdb) bt
#0 0x00007ffff70671d0 in vhost_user_if_input (vm=0x7fffb6d64648, vum=0x7ffff747ee20 <vhost_user_main>, vui=0x7fffb6ec75cc, node=0x7fffb6e06cb8) at /scratch/myciscoatt/src/vpp/build-data/../vnet/vnet/devices/virtio/vhost-user.c:1150
#1 0x00007ffff7067df5 in vhost_user_input (vm=0x7fffb6d64648, node=0x7fffb6e06cb8, f=0x0) at /scratch/myciscoatt/src/vpp/build-data/../vnet/vnet/devices/virtio/vhost-user.c:1361
#2 0x00007ffff74d4fff in dispatch_node (vm=0x7fffb6d64648, node=0x7fffb6e06cb8, type=VLIB_NODE_TYPE_INPUT, dispatch_state=VLIB_NODE_STATE_POLLING, frame=0x0, last_time_stamp=116904157643429) at /scratch/myciscoatt/src/vpp/build-data/../vlib/vlib/main.c:996
#3 0x00007ffff751a2e1 in vlib_worker_thread_internal (vm=0x7fffb6d64648) at /scratch/myciscoatt/src/vpp/build-data/../vlib/vlib/threads.c:1389
#4 0x00007ffff751a5c3 in vlib_worker_thread_fn (arg=0x7fffb5427c70) at /scratch/myciscoatt/src/vpp/build-data/../vlib/vlib/threads.c:1455
#5 0x00007ffff62b0314 in clib_calljmp () at /scratch/myciscoatt/src/vpp/build-data/../vppinfra/vppinfra/longjmp.S:110
#6 0x00007fff768bcc00 in ?? ()
#7 0x00007ffff7515a44 in vlib_worker_thread_bootstrap_fn (arg=0x7fffb5427c70) at /scratch/myciscoatt/src/vpp/build-data/../vlib/vlib/threads.c:516
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) p/x txvq
$10 = 0x7fffb6ec7954
(gdb) p/x *txvq
$11 =
The pointers look good when examined in gdb so this points to a race condition. The race condition might be that the memory is not available when system is rebooted. VPP
Dave Barach looked at this and here is his summary:
Guys,
John D. asked me to take a look at a multiple-worker, multiple-queue vhost_user crash scenario. After some fiddling, I found a scenario that’s 100% reproducible. With vpp provisioned by the ML2 plugin [or whatever calls itself “test_papi”], ssh into the compute vm and type “sudo /sbin/reboot”.
This scenario causes a mild vhost_user shared-memory earthquake with traffic flowing.
One of the worker threads will receive SIGSEGV, right here:
/* vhost_user_if_input, at or near line 1142 */
u32 next_desc =
txvq->avail->ring[(txvq->last_avail_idx + 1) & qsz_mask];
By the time one can look at the memory reference in gdb, the memory is accessible. My guess: qemu briefly changes protections on the vhost_user shared-memory segment, yadda yadda yadda.
This scenario never causes an issue when running single-queue, single-core.
An API trace - see below - indicates that vpp receives no notification of any kind. There isn’t a hell of lot that the vhost_user driver can do to protect itself.
Time for someone to stare at the quemu code, I guess...
HTH… Dave
- api trace custom-dump /tmp/twoboot
SCRIPT: memclnt_create name test_papi
SCRIPT: sw_interface_dump all
SCRIPT: control_ping
SCRIPT: sw_interface_dump all
SCRIPT: control_ping
SCRIPT: sw_interface_set_flags sw_if_index 1 admin-up link-up
SCRIPT: bridge_domain_add_del bd_id 5678 flood 1 uu-flood 1 forward 1 learn 1 arp-term 0
SCRIPT: sw_interface_set_l2_bridge sw_if_index 1 bd_id 5678 shg 0 enable
SCRIPT: tap_connect tapname vppef940067-0b mac fa:16:3e:6e:22:41
SCRIPT: sw_interface_set_flags sw_if_index 4 admin-up link-up
SCRIPT: sw_interface_set_l2_bridge sw_if_index 4 bd_id 5678 shg 0 enable
SCRIPT: sw_interface_dump all
SCRIPT: control_ping
SCRIPT: sw_interface_dump all
SCRIPT: control_ping
SCRIPT: sw_interface_set_flags sw_if_index 3 admin-up link-up
SCRIPT: bridge_domain_add_del bd_id 5679 flood 1 uu-flood 1 forward 1 learn 1 arp-term 0
SCRIPT: sw_interface_set_l2_bridge sw_if_index 3 bd_id 5679 shg 0 enable
SCRIPT: create_vhost_user_if socket /tmp/52970d78-dad3-4887-b4bf-df90d3e13602
SCRIPT: sw_interface_set_flags sw_if_index 5 admin-up link-up
SCRIPT: sw_interface_set_l2_bridge sw_if_index 5 bd_id 5679 shg 0 enable
SCRIPT: create_vhost_user_if socket /tmp/92473e06-ea98-4b4f-80df-c9bb702c3885
SCRIPT: sw_interface_set_flags sw_if_index 6 admin-up link-up
SCRIPT: sw_interface_set_l2_bridge sw_if_index 6 bd_id 5678 shg 0 enable
SCRIPT: sw_interface_dump all
SCRIPT: control_ping
SCRIPT: sw_interface_dump all
SCRIPT: control_ping
SCRIPT: sw_interface_set_flags sw_if_index 2 admin-up link-up
SCRIPT: bridge_domain_add_del bd_id 5680 flood 1 uu-flood 1 forward 1 learn 1 arp-term 0
SCRIPT: sw_interface_set_l2_bridge sw_if_index 2 bd_id 5680 shg 0 enable
SCRIPT: create_vhost_user_if socket /tmp/e2261ff9-4953-4368-a8c9-8005ccf0e896
SCRIPT: sw_interface_set_flags sw_if_index 7 admin-up link-up
SCRIPT: sw_interface_set_l2_bridge sw_if_index 7 bd_id 5680 shg 0 enable
SCRIPT: create_vhost_user_if socket /tmp/b5d9c5f0-0494-4bd0-bb28-437f5261fad5
SCRIPT: sw_interface_set_flags sw_if_index 8 admin-up link-up
SCRIPT: sw_interface_set_l2_bridge sw_if_index 8 bd_id 5679 shg 0 enable
SCRIPT: tap_connect tapname vppb7464b44-11 mac fa:16:3e:66:31:79
SCRIPT: sw_interface_set_flags sw_if_index 9 admin-up link-up
SCRIPT: sw_interface_set_l2_bridge sw_if_index 9 bd_id 5680 shg 0 enable
SCRIPT: tap_connect tapname vppab16509a-c5 mac fa:16:3e:c2:9f:ac
SCRIPT: sw_interface_set_flags sw_if_index 10 admin-up link-up
SCRIPT: sw_interface_set_l2_bridge sw_if_index 10 bd_id 5679 shg 0 enable
SCRIPT: create_vhost_user_if socket /tmp/783d34a8-3e72-4434-97cf-80c7e199e66c
SCRIPT: sw_interface_set_flags sw_if_index 11 admin-up link-up
SCRIPT: sw_interface_set_l2_bridge sw_if_index 11 bd_id 5678 shg 0 enable
SCRIPT: create_vhost_user_if socket /tmp/67a02881-e241-4ae4-abb4-dfa03e951772
SCRIPT: sw_interface_set_flags sw_if_index 12 admin-up link-up
SCRIPT: sw_interface_set_l2_bridge sw_if_index 12 bd_id 5680 shg 0 enable
SCRIPT: memclnt_create name vpp_api_test # connect vpp_api_test prior to rebooting vm, as described
SCRIPT: sw_interface_dump name_filter Ether
SCRIPT: sw_interface_dump name_filter lo
SCRIPT: sw_interface_dump name_filter pg
SCRIPT: sw_interface_dump name_filter vxlan_gpe
SCRIPT: sw_interface_dump name_filter vxlan
SCRIPT: sw_interface_dump name_filter host
SCRIPT: sw_interface_dump name_filter l2tpv3_tunnel
SCRIPT: sw_interface_dump name_filter gre
SCRIPT: sw_interface_dump name_filter lisp_gpe
SCRIPT: sw_interface_dump name_filter ipsec
SCRIPT: control_ping
SCRIPT: get_first_msg_id lb_16c904aa
SCRIPT: get_first_msg_id snat_aa4c5cd5
SCRIPT: get_first_msg_id pot_e4aba035
SCRIPT: get_first_msg_id ioam_trace_a2e66598
SCRIPT: get_first_msg_id ioam_export_eb694f98
SCRIPT: get_first_msg_id flowperpkt_789ffa7b
SCRIPT: cli_request
vl_api_memclnt_delete_t:
index: 269
handle: 0x305e16c0
REBOOT THE VM RIGHT HERE
Absolutely nothing to indicate that anything happened
SCRIPT: memclnt_create name vpp_api_test # connect vpp_api_test again
SCRIPT: sw_interface_dump name_filter Ether
SCRIPT: sw_interface_dump name_filter lo
SCRIPT: sw_interface_dump name_filter pg
SCRIPT: sw_interface_dump name_filter vxlan_gpe
SCRIPT: sw_interface_dump name_filter vxlan
SCRIPT: sw_interface_dump name_filter host
SCRIPT: sw_interface_dump name_filter l2tpv3_tunnel
SCRIPT: sw_interface_dump name_filter gre
SCRIPT: sw_interface_dump name_filter lisp_gpe
SCRIPT: sw_interface_dump name_filter ipsec
SCRIPT: control_ping
SCRIPT: get_first_msg_id lb_16c904aa
SCRIPT: get_first_msg_id snat_aa4c5cd5
SCRIPT: get_first_msg_id pot_e4aba035
SCRIPT: get_first_msg_id ioam_trace_a2e66598
SCRIPT: get_first_msg_id ioam_export_eb694f98
SCRIPT: get_first_msg_id flowperpkt_789ffa7b
SCRIPT: cli_request
vl_api_memclnt_delete_t:
index: 269
handle: 0x305e16c0
DBGvpp#