Uploaded image for project: 'vpp'
  1. vpp
  2. VPP-593

VPP becomes intermittently unresponsive on nodes hosting VMs

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Open
    • Icon: Medium Medium
    • None
    • None
    • None
    • None

      Creating a VM on a hypervisor node (MHV) causes VPP to become unresponsive after a period of time. This was tested against 17.01-rc2~3_g2d7e163~b15.x86_64 in an open stack environment using the neutron ML2 plugin for VPP (https://github.com/openstack/networking-vpp) and virtio for the NIC.

      Reproduction steps:

      1. Install vpp from 1701 stable repo and also vpp-agent

      -bash-4.2# rpm -qa|grep vpp
      vpp-devel-17.01-rc2~3_g2d7e163~b15.x86_64
      vpp-17.01-rc2~3_g2d7e163~b15.x86_64
      networking-vpp-0.0.1.dev121-2.noarch
      vpp-plugins-17.01-rc2~3_g2d7e163~b15.x86_64
      vpp-python-api-17.01-rc2~3_g2d7e163~b15.x86_64
      vpp-lib-17.01-rc2~3_g2d7e163~b15.x86_64

      2. Create a network, and also a vm on that network.

      The namespace/port created on mcp
      --bash-4.2# vppctl show bridge-domain 4 detail
      ID Index Learning U-Forwrd UU-Flood Flooding ARP-Term BVI-Intf
      4 1 on on on on off N/A

      Interface Index SHG BVI TxFlood VLAN-Tag-Rewrite
      BondEthernet0.2066 4 0 - * pop-1
      tap-0 5 0 - * none

      The vhost interface created on mhv
      -bash-4.2# vppctl show vhost
      Virtio vhost-user interfaces
      Global:
      coalesce frames 32 time 1e-3
      Interface: VirtualEthernet0/0/0 (ifindex 4)
      virtio_net_hdr_sz 12
      features mask (0xffffffffffffffff):
      features (0x50008000):
      VIRTIO_NET_F_MRG_RXBUF (15)
      VIRTIO_F_INDIRECT_DESC (28)
      VHOST_USER_F_PROTOCOL_FEATURES (30)
      protocol features (0x3)
      VHOST_USER_PROTOCOL_F_MQ (0)
      VHOST_USER_PROTOCOL_F_LOG_SHMFD (1)

      socket filename /tmp/7853dd37-6796-476d-bb8b-ae2267acb59e type client errno "Success"

      rx placement:
      thread 0 on vring 1
      tx placement: lock-free
      thread 0 on vring 0

      Memory regions (total 2)
      region fd guest_phys_addr memory_size userspace_addr mmap_offset mmap_addr
      ====== ===== ================== ================== ================== ================== ==================
      0 21 0x0000000000000000 0x00000000000a0000 0x00002aaaaac00000 0x0000000000000000 0x00002aaaaac00000
      1 22 0x00000000000c0000 0x000000001ff40000 0x00002aaaaacc0000 0x00000000000c0000 0x00002aaaaaec0000

      Virtqueue 0 (TX)
      qsz 256 last_avail_idx 81 last_used_idx 81
      avail.flags 0 avail.idx 256 used.flags 1 used.idx 81
      kickfd 23 callfd 24 errfd -1

      Virtqueue 1 (RX)
      qsz 256 last_avail_idx 118 last_used_idx 118
      avail.flags 1 avail.idx 118 used.flags 1 used.idx 118
      kickfd 19 callfd 25 errfd -1

      3. After a while (5-10min) vpp stopped responding on mhv (all vpp related command no response)

      2017-01-10T17:39:10.406206+00:00 mhv2.paslab015001.mc.metacloud.in libvirtd[25986]: 25986: error : qemuMonitorIORead:586 : Unable to read from monitor: Connection reset by peer
      2017-01-10T17:39:10.408508+00:00 mhv2.paslab015001.mc.metacloud.in libvirtd[25986]: 25986: error : qemuProcessReportLogError:1810 : internal error: qemu unexpectedly closed the monitor: QEMU waiting for connection on: di
      sconnected:unix:/tmp/fdef6234-82ae-4fc5-94da-944ba5122b9e,server
      2017-01-10T17:39:10.602566+00:00 mhv2.paslab015001.mc.metacloud.in libvirtd[25986]: 25991: error : qemuProcessReportLogError:1810 : internal error: process exited while connecting to monitor: QEMU waiting for connection on: disconnected:unix:/tmp/fdef6234-82ae-4fc5-94da-944ba5122b9e,server
      2017-01-10T17:39:49.027951+00:00 mhv2.paslab015001.mc.metacloud.in vnet[168332]: unix_signal_handler:118: received signal SIGCONT, PC 0x7f56e7785590
      2017-01-10T17:39:49.027979+00:00 mhv2.paslab015001.mc.metacloud.in vnet[168332]: received SIGTERM, exiting...
      2017-01-10T17:39:49.139973+00:00 mhv2.paslab015001.mc.metacloud.in vnet[193762]: EAL: Probing VFIO support...
      2017-01-10T17:40:00.407293+00:00 mhv2.paslab015001.mc.metacloud.in vnet[193762]: EAL: WARNING: cpu flags constant_tsc=yes nonstop_tsc=no -> using unreliable clock cycles !
      2017-01-10T17:40:00.407587+00:00 mhv2.paslab015001.mc.metacloud.in vnet[193762]: EAL: Initializing pmd_bond for eth_bond0
      2017-01-10T17:40:00.407881+00:00 mhv2.paslab015001.mc.metacloud.in vnet[193762]: EAL: Create bonded device eth_bond0 on port 0 in mode 1 on socket 0.
      2017-01-10T17:40:00.408138+00:00 mhv2.paslab015001.mc.metacloud.in vnet[193762]: EAL: PCI device 0000:00:14.0 on NUMA socket -1
      2017-01-10T17:40:00.408147+00:00 mhv2.paslab015001.mc.metacloud.in vnet[193762]: EAL: probe driver: 1af4:1000 net_virtio
      2017-01-10T17:40:00.412447+00:00 mhv2.paslab015001.mc.metacloud.in vnet[193762]: EAL: PCI device 0000:00:15.0 on NUMA socket -1
      2017-01-10T17:40:00.412457+00:00 mhv2.paslab015001.mc.metacloud.in vnet[193762]: EAL: probe driver: 1af4:1000 net_virtio

      Further investigation shows it stuck in a FUTEX_WAIT:

      -bash-4.2# strace -p 696966
      Process 696966 attached
      futex(0x305e7f6c, FUTEX_WAIT, 245, NULL

      This also happens on nodes running VPP which are not hosting VMs, but are tapping into a linux namespace for DHCP assignment ("MCP nodes"). This happens much more intermittently and has been much more difficult to reproduce.

            otroan Ole Trøan
            kevinbringard Kevin Bringard
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: