Random crash on latest 23.09.1
-
Do you have the full crash report?
What VM version are the pfSense VMs using?
Steve
-
@stephenw10 yes i have multiple "dump" i guess. i can share on the forum ? there is no private data ?
23.09.1-RELEASE
Running on VMware ESXi, 7.0.3, 22348816 with Latest VM Version : ESXi 7.0 U2 and later (VM version 19)
VM have 4vCPU, 8Go RAM
VMXNet 3 driver for network -
You can upload it here: https://nc.netgate.com/nextcloud/s/wpxLk4dAJBJBrsR
-
@stephenw10 done i uploaded 3 files
-
Ok that's 3 identical backtraces:
db:1:pfs> bt Tracing pid 0 tid 100011 td 0xfffffe00093aee40 kdb_enter() at kdb_enter+0x32/frame 0xfffffe00085ae180 vpanic() at vpanic+0x163/frame 0xfffffe00085ae2b0 panic() at panic+0x43/frame 0xfffffe00085ae310 trap_fatal() at trap_fatal+0x40c/frame 0xfffffe00085ae370 trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00085ae3d0 calltrap() at calltrap+0x8/frame 0xfffffe00085ae3d0 --- trap 0xc, rip = 0xffffffff80af8550, rsp = 0xfffffe00085ae4a0, rbp = 0xfffffe00085ae4a0 --- vmxnet3_isc_txd_credits_update() at vmxnet3_isc_txd_credits_update+0x20/frame 0xfffffe00085ae4a0 iflib_fast_intr_rxtx() at iflib_fast_intr_rxtx+0xf7/frame 0xfffffe00085ae500 intr_event_handle() at intr_event_handle+0x126/frame 0xfffffe00085ae570 intr_execute_handlers() at intr_execute_handlers+0x49/frame 0xfffffe00085ae5a0 Xapic_isr2() at Xapic_isr2+0xdc/frame 0xfffffe00085ae5a0 --- interrupt, rip = 0xffffffff80af85d2, rsp = 0xfffffe00085ae670, rbp = 0xfffffe00085ae670 --- vmxnet3_isc_txd_credits_update() at vmxnet3_isc_txd_credits_update+0xa2/frame 0xfffffe00085ae670 iflib_completed_tx_reclaim() at iflib_completed_tx_reclaim+0x55/frame 0xfffffe00085ae6e0 iflib_txq_drain() at iflib_txq_drain+0x6b/frame 0xfffffe00085ae760 drain_ring_lockless() at drain_ring_lockless+0x5e/frame 0xfffffe00085ae7b0 ifmp_ring_enqueue() at ifmp_ring_enqueue+0x265/frame 0xfffffe00085ae7f0 iflib_if_transmit() at iflib_if_transmit+0x243/frame 0xfffffe00085ae860 ether_output_frame() at ether_output_frame+0xa3/frame 0xfffffe00085ae890 ether_output() at ether_output+0x673/frame 0xfffffe00085ae920 ip_output_send() at ip_output_send+0xdc/frame 0xfffffe00085ae960 ip_output() at ip_output+0x1284/frame 0xfffffe00085aea60 ip_forward() at ip_forward+0x3c2/frame 0xfffffe00085aeb10 ip_input() at ip_input+0x6e9/frame 0xfffffe00085aeb70 netisr_dispatch_src() at netisr_dispatch_src+0x22c/frame 0xfffffe00085aebc0 ether_demux() at ether_demux+0x149/frame 0xfffffe00085aebf0 ether_nh_input() at ether_nh_input+0x36e/frame 0xfffffe00085aec50 netisr_dispatch_src() at netisr_dispatch_src+0xaf/frame 0xfffffe00085aeca0 ether_input() at ether_input+0x69/frame 0xfffffe00085aed00 iflib_rxeof() at iflib_rxeof+0xc46/frame 0xfffffe00085aee00 _task_fn_rx() at _task_fn_rx+0x72/frame 0xfffffe00085aee40 gtaskqueue_run_locked() at gtaskqueue_run_locked+0x14e/frame 0xfffffe00085aeec0 gtaskqueue_thread_loop() at gtaskqueue_thread_loop+0xc2/frame 0xfffffe00085aeef0 fork_exit() at fork_exit+0x7f/frame 0xfffffe00085aef30 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00085aef30 --- trap 0, rip = 0, rsp = 0, rbp = 0 ---
And it's the same crash as this thread:
https://forum.netgate.com/topic/184597/pfsense-reboot-randomly-on-vmware/It looks similar to this FreeBSD bug but that is already fixed: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=239118
I don't see it in the logs but are you running WireGuard?
You might setting the tunables shown here for the descriptor values:
https://docs.netgate.com/pfsense/en/latest/hardware/tune.html#vmware-vmx-4-interfacesI'll check if there have been any updates there....
Steve
-
Ok several devs are looking at this and it looks like there is a suspect.
How much SWAP space do you have on there? We may need to enable a full crash dump to confirm this.
-
Hello @stephenw10
First of all I must thank you for your help :)
We are not using Wireguard, only IPSEC Site-to-Site (9 tunnels).
We have OpenVPN too, but server are disabled.SWAP is 1024MB
For the tunable, i just have to put this :
hw.pci.honor_msi_blacklist="0" dev.vmx.<id>.iflib.override_ntxds="0,4096" dev.vmx.<id>.iflib.override_nrxds="0,2048,0"
for all my interfaces ?
-
Yes for each vmx NIC.
The issue we are looking at looks to be when descriptors are exhausted so if you set those values it should at least take much longer to hit. -
thanks @stephenw10 i did the changes, i just have to reboot now my "primary" right now (I'm waiting for the end of working hours, the CARP switchover always generates a small interruption on the IPSECs)
-
reboot done, we'll see ;-)
-
i had my first crash on a vmotion on the secondary pfsense :
Fatal trap 12: page fault while in kernel mode cpuid = 3; apic id = 03 fault virtual address = 0x0 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80fb1c0a stack pointer = 0x0:0xfffffe000859f7d0 frame pointer = 0x0:0xfffffe000859f920 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 0 (if_io_tqg_3) rdi: 0000000000000000 rsi: fffff800b4b9c07a rdx: 0000000000000000 rcx: 0000000005966257 r8: 00000000a1990c31 r9: 0000000023e34fa7 rax: 0000000000000002 rbx: fffff800b4b9c000 rbp: fffffe000859f920 r10: 0000000000003354 r11: fffff800b4b9c000 r12: fffffe000859f980 r13: 0000000000000000 r14: 0000000000000000 r15: fffff8000cce2608 trap number = 12 panic: page fault cpuid = 3 time = 1707492630 KDB: enter: panic
i vmotion the primary and no crash...
-
We need to see the backtrace to know more there.
-
@stephenw10 i upload files in your nextcloud link.
-
Backtrace:
db:1:pfs> bt Tracing pid 0 tid 100014 td 0xfffffe000932a740 kdb_enter() at kdb_enter+0x32/frame 0xfffffe000859f4b0 vpanic() at vpanic+0x163/frame 0xfffffe000859f5e0 panic() at panic+0x43/frame 0xfffffe000859f640 trap_fatal() at trap_fatal+0x40c/frame 0xfffffe000859f6a0 trap_pfault() at trap_pfault+0x4f/frame 0xfffffe000859f700 calltrap() at calltrap+0x8/frame 0xfffffe000859f700 --- trap 0xc, rip = 0xffffffff80fb1c0a, rsp = 0xfffffe000859f7d0, rbp = 0xfffffe000859f920 --- pf_test_state_tcp() at pf_test_state_tcp+0x125a/frame 0xfffffe000859f920 pf_test() at pf_test+0x1353/frame 0xfffffe000859fac0 pf_check_in() at pf_check_in+0x27/frame 0xfffffe000859fae0 pfil_mbuf_in() at pfil_mbuf_in+0x38/frame 0xfffffe000859fb10 ip_input() at ip_input+0x3ae/frame 0xfffffe000859fb70 netisr_dispatch_src() at netisr_dispatch_src+0x22c/frame 0xfffffe000859fbc0 ether_demux() at ether_demux+0x149/frame 0xfffffe000859fbf0 ether_nh_input() at ether_nh_input+0x36e/frame 0xfffffe000859fc50 netisr_dispatch_src() at netisr_dispatch_src+0xaf/frame 0xfffffe000859fca0 ether_input() at ether_input+0x69/frame 0xfffffe000859fd00 iflib_rxeof() at iflib_rxeof+0xc46/frame 0xfffffe000859fe00 _task_fn_rx() at _task_fn_rx+0x72/frame 0xfffffe000859fe40 gtaskqueue_run_locked() at gtaskqueue_run_locked+0x14e/frame 0xfffffe000859fec0 gtaskqueue_thread_loop() at gtaskqueue_thread_loop+0xc2/frame 0xfffffe000859fef0 fork_exit() at fork_exit+0x7f/frame 0xfffffe000859ff30 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe000859ff30 --- trap 0, rip = 0, rsp = 0, rbp = 0 ---
So not the same issue.
Seems similar to a few other bugs but not identical.
The message buffer shows it failing back and forth between the nodes a few times was that expected? -
@stephenw10 said in Random crash on latest 23.09.1:
That last backtrace decodes to /var/jenkins/workspace/pfSense-Plus-snapshots-23_09_1-main/sources/FreeBSD-src-plus-RELENG_23_09_1/sys/netpfil/pf/pf.c:5743, which is in pf_test_state_tcp(), where it applies NAT. It likely means that the state has a NULL key (pf_kstate->key[]).
It's not clear to me how that'd happen. Speculatively, perhaps there's a race on state insertion, or there's something wrong in the pfsync state transfer. A full core dump might be helpful here, if this can be reproduced. -
@kprovost how can i have a full core dump ?
-
You can just set the ddb file to dump rather than textdump but you need enough SWAP space to dump to and 1GB probably isn't enough.
So you can reinstall with more swap space or add SWAP somehow. For example: https://forum.netgate.com/post/1127502
-
@stephenw10 i add a second disk to VM and i have now a 12Go SWAP.
My config was :
#script kdb.enter.default=textdump set; capture on; run pfs ; capture off; textdump dump; reset
Replaced by :
script kdb.enter.default=bt ; show registers ; dump ; reset
I reboot too.
-
Great. You can check that it's working as expected by forcing a panic and seeing if the kernel core dump is created.
Running:
sysctl debug.kdb.panic=1
will panic the system immediately and should create the core dump.Steve
-
@stephenw10 i test your command on my "backup" pfsense and it's worked, got a 1Go file vmcore.0
So we just have to wait now...