Running VPP debug image with multiple worker threads, with two ports receiving IP packets streams which are IP4 forwarded to each other. When an IP route is added, an assert is hit that causes VPP to crash:
/usr/bin/vpp[2446]: /scratch/loj/vpp-vts251/build-data/../src/vlib/main.c:293 (vlib_next_frame_change_ownership)
assertion `vec_len (node->next_nodes) == node_runtime->n_next_nodes' fails
and the call trace of this function is from the ip4_lookup path:
#5 0x00007ffff78f50fd in vlib_next_frame_change_ownership (vm=0x7fffb611fa38, node_runtime=0x7fffb6c6e354, next_index=0)
at /scratch/loj/vpp-vts251/build-data/../src/vlib/main.c:293
#6 0x00007ffff78f54b8 in vlib_get_next_frame_internal (vm=0x7fffb611fa38, node=0x7fffb6c6e354, next_index=0,
allocate_new_next_frame=0) at /scratch/loj/vpp-vts251/build-data/../src/vlib/main.c:373
#7 0x00007ffff6f6d5b8 in ip4_lookup_inline (vm=0x7fffb611fa38, node=0x7fffb6c6e354, frame=0x7fffb60b9a00,
lookup_for_responses_to_locally_received_packets=0) at /scratch/loj/vpp-vts251/build-data/../src/vnet/ip/ip4_forward.c:87
#8 0x00007ffff6f6ef23 in ip4_lookup (vm=0x7fffb611fa38, node=0x7fffb6c6e354, frame=0x7fffb60b9a00)
at /scratch/loj/vpp-vts251/build-data/../src/vnet/ip/ip4_forward.c:471
#9 0x00007ffff78f6fad in dispatch_node (vm=0x7fffb611fa38, node=0x7fffb6c6e354, type=VLIB_NODE_TYPE_INTERNAL,
dispatch_state=VLIB_NODE_STATE_POLLING, frame=0x7fffb60b9a00, last_time_stamp=177124969750773)
From gdb, we can see they are off by 1:
(gdb) frame 5
#5 0x00007ffff78f50fd in vlib_next_frame_change_ownership (vm=0x7fffb611fa38, node_runtime=0x7fffb6c6e354, next_index=0)
at /scratch/loj/vpp-vts251/build-data/../src/vlib/main.c:293
293 /scratch/loj/vpp-vts251/build-data/../src/vlib/main.c: No such file or directory.
(gdb) p node_runtime->n_next_nodes
$1 = 10
(gdb) p vl(node->next_nodes)
$2 = 11
The node here is the ip4-lookup node:
(gdb) p *node
$1 = {function = 0x7ffff6f6eee7 <ip4_lookup>, name = 0x7fffb5b246d8 "ip4-lookup", name_elog_string = 4393, stats_total = {calls = 6,
vectors = 1146, clocks = 421890, suspends = 0, max_clock = 59737, max_clock_n = 191}, stats_last_clear = {calls = 0,
vectors = 0, clocks = 0, suspends = 0, max_clock = 0, max_clock_n = 0}, type = VLIB_NODE_TYPE_INTERNAL, index = 262,
runtime_index = 225, runtime_data = 0x0, flags = 0, state = 0 '\000', runtime_data_bytes = 0 '\000', n_errors = 0,
scalar_size = 0, vector_size = 4, error_heap_handle = 0, error_heap_index = 0, error_strings = 0x0, next_node_names = 0x0,
next_nodes = 0x7fffb5b24f24, sibling_of = 0x0, sibling_bitmap = 0x7fffb5881e78, n_vectors_by_next_node = 0x7fffb5c1444c,
next_slot_by_node = 0x7fffb583e128, prev_node_bitmap = 0x7fffb5c328e8, owner_node_index = 249, owner_next_index = 2,
format_buffer = 0x0, unformat_buffer = 0x0, format_trace = 0x7ffff6f71a52 <format_ip4_lookup_trace>, validate_frame = 0x0,
state_string = 0x0}
(gdb) p *node_runtime
$2 = {cacheline0 = 0x7fffb6c77954 "\347\356\366\366\377\177", function = 0x7ffff6f6eee7 <ip4_lookup>, errors = 0x7fffb5c1443c,
clocks_since_last_overflow = 0, max_clock = 0, max_clock_n = 0, calls_since_last_overflow = 0, vectors_since_last_overflow = 0,
next_frame_index = 764, node_index = 262, input_main_loops_per_call = 0, main_loop_count_last_dispatch = 0,
main_loop_vector_stats = {0, 0}, flags = 0, state = 0, n_next_nodes = 10, cached_next_index = 0, thread_index = 1,
runtime_data = 0x7fffb6c7799a ""}
The test setup is with the following config:
set int state TenGigabitEthernetf/0/0 up
set int ip addr TenGigabitEthernetf/0/0 7.0.0.2/24
set ip arp TenGigabitEthernetf/0/0 7.0.0.1 88:1d:fc:c3:d6:c3 static
ip route add 39.0.0.0/24 via 7.0.0.1
set int state TenGigabitEthernet81/0/1 up
set int ip addr TenGigabitEthernet81/0/1 38.0.0.2/24
set ip arp TenGigabitEthernet81/0/1 38.0.0.254 e4:c7:22:55:31:e4 static
After VPP startup, with active traffic flowing into TenGigabitEthernetf/0/0 with SIP/DIP being 39.0.0.x -> 38.0.0.254, if we start to apply the above config one by one on the VPP CLI, the assert will hit on the 4th "ip route ..." command.
A temporary way to fix this is by marking the "ip route .." command not thread safe so barrier sync will be used for this command, as follows:
diff --git a/src/vnet/ip/lookup.c b/src/vnet/ip/lookup.c
index 6547cad..edc9516 100755
— a/src/vnet/ip/lookup.c
+++ b/src/vnet/ip/lookup.c
@@ -749,7 +749,7 @@ VLIB_CLI_COMMAND (ip_route_command, static) = {
.path = "ip route",
.short_help = "ip route [add|del] [count <n>] <dst-ip-addr>/<width> [table <table-id>] [via <next-hop-ip
.function = vnet_ip_route_cmd,
- .is_mp_safe = 1,
+// .is_mp_safe = 1,
};
/* INDENT-ON */