-
Bug
-
Resolution: Unresolved
-
High
-
None
-
None
-
None
Observed during performance testing using VPP V19.04.
Worker Handoff enabled with a 8 mixed RX/Worker cores.
Each Worker Handoff node routes traffic based on src/dst ip address to ethernet-input node either on the local core or one of the other worker cores.
After a time under load , one or more worker cores stopped processing traffic , worker handoff nodes begin reporting congestion. This is an unrecoverable state. I believe there is an issue with how the worker handoff thread is dequeued.
In src/vlib/buffer_node:vlib_buffer_enqueue_to_thread
If the queue is not congested then the check_frame_queue flag of the vlib_main associated with the target thread handoff queue is set to 1
vlib_mains[next_thread_index]->check_frame_queues = 1;
In src/vlib/main.c vlib_main_or_worker_loop
The queue is read as follows:
if (!is_main)
{
vlib_worker_thread_barrier_check ();
if (PREDICT_FALSE (vm->check_frame_queues +
frame_queue_check_counter))
{
u32 processed = 0;
if (vm->check_frame_queues)
{ frame_queue_check_counter = 100; vm->check_frame_queues = 0; }vec_foreach (fqm, tm->frame_queue_mains)
processed += vlib_frame_queue_dequeue (vm, fqm);
/* No handoff queue work found? */
if (processed)
frame_queue_check_counter = 100;
else
frame_queue_check_counter--;
}
if (PREDICT_FALSE (vm->worker_thread_main_loop_callback != 0))
((void (vlib_main_t *)) vm->worker_thread_main_loop_callback)
(vm);
}
After 100 consecutive unsuccessful attempts to dequeue a frame there is a mechanism to back off until check_frame_queues is set to 1 again by the enqueue function in src/vlib/buffer_node.
It is however possible that the queue becomes congested , the vm->check_frame_queues becomes 0 and there are 100 unsuccessful attempts to dequeue. In this scenario
the queue is congested with valid frames but vm->check_frame_queues will never again be set to 1.
I suspect this is because vlib_frame_queue_dequeue does the following check and abandons its scan if elt->valid is false, if the head of the queues elt->valid flag is not set true quickly its possible for the queue to build up to a congested state while the dequeue function reads nothing from the queue.
if (!elt->valid)
{ fq->head_hint = fq->head; return processed; }A workaround for this was to add a line to vlib_buffer_enqueue_to_thread to set check_frame_queues to 1 again if the queue is congested. This prompted the dequeing to resume.
Test Setup
Steps to reproduce
Difficult to reproduce without sufficient load , unpredictable.
Configure VPP to handoff traffic between 8 RX/worker cores
# set interface handoff TwentyFiveGigabitEthernet86/0/0 workers 0 1 2 3 4 5 6 7