Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: High
Fix Version/s: None
Affects Version/s: None
Component/s: vlib
Labels:
None

Observed during performance testing using VPP V19.04.
Worker Handoff enabled with a 8 mixed RX/Worker cores.
Each Worker Handoff node routes traffic based on src/dst ip address to ethernet-input node either on the local core or one of the other worker cores.
After a time under load , one or more worker cores stopped processing traffic , worker handoff nodes begin reporting congestion. This is an unrecoverable state. I believe there is an issue with how the worker handoff thread is dequeued.

In src/vlib/buffer_node:vlib_buffer_enqueue_to_thread

If the queue is not congested then the check_frame_queue flag of the vlib_main associated with the target thread handoff queue is set to 1

vlib_mains[next_thread_index]->check_frame_queues = 1;

In src/vlib/main.c vlib_main_or_worker_loop

The queue is read as follows:

if (!is_main)
{
vlib_worker_thread_barrier_check ();
if (PREDICT_FALSE (vm->check_frame_queues +
frame_queue_check_counter))
{
u32 processed = 0;

if (vm->check_frame_queues)

{ frame_queue_check_counter = 100; vm->check_frame_queues = 0; }

vec_foreach (fqm, tm->frame_queue_mains)
processed += vlib_frame_queue_dequeue (vm, fqm);

/* No handoff queue work found? */
if (processed)
frame_queue_check_counter = 100;
else
frame_queue_check_counter--;
}
if (PREDICT_FALSE (vm->worker_thread_main_loop_callback != 0))
((void (vlib_main_t *)) vm->worker_thread_main_loop_callback)
(vm);
}

After 100 consecutive unsuccessful attempts to dequeue a frame there is a mechanism to back off until check_frame_queues is set to 1 again by the enqueue function in src/vlib/buffer_node.

It is however possible that the queue becomes congested , the vm->check_frame_queues becomes 0 and there are 100 unsuccessful attempts to dequeue. In this scenario
the queue is congested with valid frames but vm->check_frame_queues will never again be set to 1.

I suspect this is because vlib_frame_queue_dequeue does the following check and abandons its scan if elt->valid is false, if the head of the queues elt->valid flag is not set true quickly its possible for the queue to build up to a congested state while the dequeue function reads nothing from the queue.

if (!elt->valid)

{ fq->head_hint = fq->head; return processed; }

A workaround for this was to add a line to vlib_buffer_enqueue_to_thread to set check_frame_queues to 1 again if the queue is congested. This prompted the dequeing to resume.

Test Setup

Steps to reproduce

Difficult to reproduce without sufficient load , unpredictable.

Configure VPP to handoff traffic between 8 RX/worker cores

# set interface handoff TwentyFiveGigabitEthernet86/0/0 workers 0 1 2 3 4 5 6 7

Assignee:: Unassigned

Reporter:: Hongjun Ni

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 12/Aug/19 7:14 AM

Updated:: 22/Oct/19 7:36 AM

Resolved:: 12/Aug/19 7:14 AM

Details

Description

Test Setup

Steps to reproduce

Attachments

Activity

People

Dates