The following sequence of steps causes VPP to crash:
- Configure SNAT
- Ping from VM to outside so that there is dynamic NAT session
- Set static 1:1 mapping for the VM’s fixed IP addr
- Check for dynamic NAT sessions using nat44_user_dump followed by nat44_user_session_dump
VPP crashes when nat44_user_session_dump is executed.
Here is the backtrace:
(gdb) bt
#0 0x00007fffb286594e in snat_get_worker_in2out_cb (ip0=0x7fffb5c60d70, rx_fib_index0=1) at /home/heat-admin/vpp_stable_1801/build-data/../src/plugins/nat/nat.c:2758
#1 0x00007fffb287a8e7 in vl_api_nat44_user_session_dump_t_handler (mp=0x300b7924) at /home/heat-admin/vpp_stable_1801/build-data/../src/plugins/nat/nat_api.c:1166
#2 0x00007ffff7b9abd0 in vl_msg_api_handler_with_vm_node (am=0x7ffff7dda460 <api_main>, the_msg=0x300b7924, vm=0x7ffff7b8c480 <vlib_global_main>, node=0x7fffb5c58000)
at /home/heat-admin/vpp_stable_1801/build-data/../src/vlibapi/api_shared.c:508
#3 0x00007ffff7bb6d83 in memclnt_process (vm=0x7ffff7b8c480 <vlib_global_main>, node=0x7fffb5c58000, f=0x0)
at /home/heat-admin/vpp_stable_1801/build-data/../src/vlibmemory/memory_vlib.c:970
#4 0x00007ffff78e728e in vlib_process_bootstrap (_a=140736240516176) at /home/heat-admin/vpp_stable_1801/build-data/../src/vlib/main.c:1231
#5 0x00007ffff5f506b0 in clib_calljmp () at /home/heat-admin/vpp_stable_1801/build-data/../src/vppinfra/longjmp.S:110
#6 0x00007fffb59f7c20 in ?? ()
#7 0x00007ffff78e73c3 in vlib_process_startup (vm=0x7fffb5c58000, p=0x1b, f=0x7ffff78e7455 <vlib_process_resume+111>)
at /home/heat-admin/vpp_stable_1801/build-data/../src/vlib/main.c:1253
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb)
The crash happens because snat_get_worker_in2out_cb is trying to access the variable sm->worker which is NULL:
(gdb) c
Continuing.
Program received signal SIGSEGV, Segmentation fault.
0x00007fffb286594e in snat_get_worker_in2out_cb (ip0=0x7fffb5b01d70, rx_fib_index0=0) at /home/heat-admin/vpp_stable_1801/build-data/../src/plugins/nat/nat.c:2758
2758 if (PREDICT_TRUE (is_pow2 (_vec_len (sm->workers))))
(gdb)
(gdb) p snat_main
.
.
icmp_match_out2in_cb = 0x7fffb28a24ca <icmp_match_out2in_slow>, num_workers = 1, first_worker_index = 1, next_worker = 0, workers = 0x0,
Here’s the culprit code in nat.c:
2714 if (PREDICT_TRUE (is_pow2 (_vec_len (sm->workers))))
2715 next_worker_index += sm->workers[hash & (_vec_len (sm->workers) - 1)];
2716 else
2717 next_worker_index += sm->workers[hash % _vec_len (sm->workers)];
I set breakpoints and watched sm->worker from the time snat_init() initializes it to NULL and found that it stays NULL across the various function calls. Simply enclosing the above chunk of code within a NULL check prevents the crash:
[root@overcloud-controller-0 bin]# git diff ../../../../src/plugins/nat/nat.c
diff --git a/src/plugins/nat/nat.c b/src/plugins/nat/nat.c
index acd1671..cbb261e 100644
— a/src/plugins/nat/nat.c
+++ b/src/plugins/nat/nat.c
@@ -2741,10 +2741,13 @@ snat_get_worker_in2out_cb (ip4_header_t * ip0, u32 rx_fib_index0)
hash = ip0->src_address.as_u32 + (ip0->src_address.as_u32 >> 8) +
(ip0->src_address.as_u32 >> 16) + (ip0->src_address.as_u32 >>24);
if (PREDICT_TRUE (is_pow2 (_vec_len (sm>workers))))
next_worker_index += sm>workers[hash & (_vec_len (sm->workers) - 1)];
- else
next_worker_index += sm>workers[hash % _vec_len (sm->workers)];
+ if (sm->workers)
+
{ + if (PREDICT_TRUE (is_pow2 (_vec_len (sm->workers)))) + next_worker_index += sm->workers[hash & (_vec_len (sm->workers) - 1)]; + else + next_worker_index += sm->workers[hash % _vec_len (sm->workers)]; + }
return next_worker_index;
}
But I am not too sure if this is the only change required. Maybe the variable ought not to be NULL? Maybe something else is the matter.