Uploaded image for project: 'vpp'
  1. vpp
  2. VPP-1734

Worker handoff Queue congestion

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: High High
    • None
    • None
    • vlib
    • None

      Observed during performance testing using VPP V19.04.
      Worker Handoff enabled with a 8 mixed RX/Worker cores.
      Each Worker Handoff node routes traffic based on src/dst ip address to ethernet-input node either on the local core or one of the other worker cores.
      After a time under load , one or more worker cores stopped processing traffic , worker handoff nodes begin reporting congestion. This is an unrecoverable state. I believe there is an issue with how the worker handoff thread is dequeued.

      In src/vlib/buffer_node:vlib_buffer_enqueue_to_thread

      If the queue is not congested then the check_frame_queue flag of the vlib_main associated with the target thread handoff queue is set to 1

      vlib_mains[next_thread_index]->check_frame_queues = 1;

      In src/vlib/main.c vlib_main_or_worker_loop

      The queue is read as follows:

      if (!is_main)
      {
      vlib_worker_thread_barrier_check ();
      if (PREDICT_FALSE (vm->check_frame_queues +
      frame_queue_check_counter))
      {
      u32 processed = 0;

      if (vm->check_frame_queues)

      { frame_queue_check_counter = 100; vm->check_frame_queues = 0; }

      vec_foreach (fqm, tm->frame_queue_mains)
      processed += vlib_frame_queue_dequeue (vm, fqm);

      /* No handoff queue work found? */
      if (processed)
      frame_queue_check_counter = 100;
      else
      frame_queue_check_counter--;
      }
      if (PREDICT_FALSE (vm->worker_thread_main_loop_callback != 0))
      ((void (vlib_main_t *)) vm->worker_thread_main_loop_callback)
      (vm);
      }

      After 100 consecutive unsuccessful attempts to dequeue a frame there is a mechanism to back off until check_frame_queues is set to 1 again by the enqueue function in src/vlib/buffer_node.

      It is however possible that the queue becomes congested , the vm->check_frame_queues becomes 0 and there are 100 unsuccessful attempts to dequeue. In this scenario
      the queue is congested with valid frames but vm->check_frame_queues will never again be set to 1.

      I suspect this is because vlib_frame_queue_dequeue does the following check and abandons its scan if elt->valid is false, if the head of the queues elt->valid flag is not set true quickly its possible for the queue to build up to a congested state while the dequeue function reads nothing from the queue.

      if (!elt->valid)

      { fq->head_hint = fq->head; return processed; }

      A workaround for this was to add a line to vlib_buffer_enqueue_to_thread to set check_frame_queues to 1 again if the queue is congested. This prompted the dequeing to resume.

       

      Test Setup

      Steps to reproduce

      Difficult to reproduce without sufficient load , unpredictable.

      Configure VPP to handoff traffic between 8 RX/worker cores

      # set interface handoff TwentyFiveGigabitEthernet86/0/0 workers 0 1 2 3 4 5 6 7

            Unassigned Unassigned
            Henry_Ni Hongjun Ni
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: