-
Bug
-
Resolution: Open
-
Medium
-
None
-
None
-
None
-
None
Creating a VM on a hypervisor node (MHV) causes VPP to become unresponsive after a period of time. This was tested against 17.01-rc2~3_g2d7e163~b15.x86_64 in an open stack environment using the neutron ML2 plugin for VPP (https://github.com/openstack/networking-vpp) and virtio for the NIC.
Reproduction steps:
1. Install vpp from 1701 stable repo and also vpp-agent
-bash-4.2# rpm -qa|grep vpp
vpp-devel-17.01-rc2~3_g2d7e163~b15.x86_64
vpp-17.01-rc2~3_g2d7e163~b15.x86_64
networking-vpp-0.0.1.dev121-2.noarch
vpp-plugins-17.01-rc2~3_g2d7e163~b15.x86_64
vpp-python-api-17.01-rc2~3_g2d7e163~b15.x86_64
vpp-lib-17.01-rc2~3_g2d7e163~b15.x86_64
2. Create a network, and also a vm on that network.
The namespace/port created on mcp
--bash-4.2# vppctl show bridge-domain 4 detail
ID Index Learning U-Forwrd UU-Flood Flooding ARP-Term BVI-Intf
4 1 on on on on off N/A
Interface Index SHG BVI TxFlood VLAN-Tag-Rewrite
BondEthernet0.2066 4 0 - * pop-1
tap-0 5 0 - * none
The vhost interface created on mhv
-bash-4.2# vppctl show vhost
Virtio vhost-user interfaces
Global:
coalesce frames 32 time 1e-3
Interface: VirtualEthernet0/0/0 (ifindex 4)
virtio_net_hdr_sz 12
features mask (0xffffffffffffffff):
features (0x50008000):
VIRTIO_NET_F_MRG_RXBUF (15)
VIRTIO_F_INDIRECT_DESC (28)
VHOST_USER_F_PROTOCOL_FEATURES (30)
protocol features (0x3)
VHOST_USER_PROTOCOL_F_MQ (0)
VHOST_USER_PROTOCOL_F_LOG_SHMFD (1)
socket filename /tmp/7853dd37-6796-476d-bb8b-ae2267acb59e type client errno "Success"
rx placement:
thread 0 on vring 1
tx placement: lock-free
thread 0 on vring 0
Memory regions (total 2)
region fd guest_phys_addr memory_size userspace_addr mmap_offset mmap_addr
====== ===== ================== ================== ================== ================== ==================
0 21 0x0000000000000000 0x00000000000a0000 0x00002aaaaac00000 0x0000000000000000 0x00002aaaaac00000
1 22 0x00000000000c0000 0x000000001ff40000 0x00002aaaaacc0000 0x00000000000c0000 0x00002aaaaaec0000
Virtqueue 0 (TX)
qsz 256 last_avail_idx 81 last_used_idx 81
avail.flags 0 avail.idx 256 used.flags 1 used.idx 81
kickfd 23 callfd 24 errfd -1
Virtqueue 1 (RX)
qsz 256 last_avail_idx 118 last_used_idx 118
avail.flags 1 avail.idx 118 used.flags 1 used.idx 118
kickfd 19 callfd 25 errfd -1
3. After a while (5-10min) vpp stopped responding on mhv (all vpp related command no response)
2017-01-10T17:39:10.406206+00:00 mhv2.paslab015001.mc.metacloud.in libvirtd[25986]: 25986: error : qemuMonitorIORead:586 : Unable to read from monitor: Connection reset by peer
2017-01-10T17:39:10.408508+00:00 mhv2.paslab015001.mc.metacloud.in libvirtd[25986]: 25986: error : qemuProcessReportLogError:1810 : internal error: qemu unexpectedly closed the monitor: QEMU waiting for connection on: di
sconnected:unix:/tmp/fdef6234-82ae-4fc5-94da-944ba5122b9e,server
2017-01-10T17:39:10.602566+00:00 mhv2.paslab015001.mc.metacloud.in libvirtd[25986]: 25991: error : qemuProcessReportLogError:1810 : internal error: process exited while connecting to monitor: QEMU waiting for connection on: disconnected:unix:/tmp/fdef6234-82ae-4fc5-94da-944ba5122b9e,server
2017-01-10T17:39:49.027951+00:00 mhv2.paslab015001.mc.metacloud.in vnet[168332]: unix_signal_handler:118: received signal SIGCONT, PC 0x7f56e7785590
2017-01-10T17:39:49.027979+00:00 mhv2.paslab015001.mc.metacloud.in vnet[168332]: received SIGTERM, exiting...
2017-01-10T17:39:49.139973+00:00 mhv2.paslab015001.mc.metacloud.in vnet[193762]: EAL: Probing VFIO support...
2017-01-10T17:40:00.407293+00:00 mhv2.paslab015001.mc.metacloud.in vnet[193762]: EAL: WARNING: cpu flags constant_tsc=yes nonstop_tsc=no -> using unreliable clock cycles !
2017-01-10T17:40:00.407587+00:00 mhv2.paslab015001.mc.metacloud.in vnet[193762]: EAL: Initializing pmd_bond for eth_bond0
2017-01-10T17:40:00.407881+00:00 mhv2.paslab015001.mc.metacloud.in vnet[193762]: EAL: Create bonded device eth_bond0 on port 0 in mode 1 on socket 0.
2017-01-10T17:40:00.408138+00:00 mhv2.paslab015001.mc.metacloud.in vnet[193762]: EAL: PCI device 0000:00:14.0 on NUMA socket -1
2017-01-10T17:40:00.408147+00:00 mhv2.paslab015001.mc.metacloud.in vnet[193762]: EAL: probe driver: 1af4:1000 net_virtio
2017-01-10T17:40:00.412447+00:00 mhv2.paslab015001.mc.metacloud.in vnet[193762]: EAL: PCI device 0000:00:15.0 on NUMA socket -1
2017-01-10T17:40:00.412457+00:00 mhv2.paslab015001.mc.metacloud.in vnet[193762]: EAL: probe driver: 1af4:1000 net_virtio
Further investigation shows it stuck in a FUTEX_WAIT:
-bash-4.2# strace -p 696966
Process 696966 attached
futex(0x305e7f6c, FUTEX_WAIT, 245, NULL
This also happens on nodes running VPP which are not hosting VMs, but are tapping into a linux namespace for DHCP assignment ("MCP nodes"). This happens much more intermittently and has been much more difficult to reproduce.