pfsense reboot randomly on vmware

stephenw10

Do you have a full crash report from that?

security_sharezone

stephenw10

Backtrace:

db:0:kdb.enter.default>  bt
Tracing pid 11 tid 100006 td 0xfffffe00c5876e40
kdb_enter() at kdb_enter+0x32/frame 0xfffffe00c57a6880
vpanic() at vpanic+0x163/frame 0xfffffe00c57a69b0
panic() at panic+0x43/frame 0xfffffe00c57a6a10
trap_fatal() at trap_fatal+0x40c/frame 0xfffffe00c57a6a70
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00c57a6ad0
calltrap() at calltrap+0x8/frame 0xfffffe00c57a6ad0
--- trap 0xc, rip = 0xffffffff80af1d90, rsp = 0xfffffe00c57a6ba0, rbp = 0xfffffe00c57a6ba0 ---
vmxnet3_isc_txd_credits_update() at vmxnet3_isc_txd_credits_update+0x20/frame 0xfffffe00c57a6ba0
iflib_fast_intr_rxtx() at iflib_fast_intr_rxtx+0xf7/frame 0xfffffe00c57a6c00
intr_event_handle() at intr_event_handle+0x126/frame 0xfffffe00c57a6c70
intr_execute_handlers() at intr_execute_handlers+0x49/frame 0xfffffe00c57a6ca0
Xapic_isr1() at Xapic_isr1+0xdc/frame 0xfffffe00c57a6ca0
--- interrupt, rip = 0xffffffff81255c76, rsp = 0xfffffe00c57a6d70, rbp = 0xfffffe00c57a6d70 ---
acpi_cpu_c1() at acpi_cpu_c1+0x6/frame 0xfffffe00c57a6d70
acpi_cpu_idle() at acpi_cpu_idle+0x2fe/frame 0xfffffe00c57a6db0
cpu_idle_acpi() at cpu_idle_acpi+0x46/frame 0xfffffe00c57a6dd0
cpu_idle() at cpu_idle+0x9d/frame 0xfffffe00c57a6df0
sched_idletd() at sched_idletd+0x576/frame 0xfffffe00c57a6ef0
fork_exit() at fork_exit+0x7f/frame 0xfffffe00c57a6f30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00c57a6f30

Panic 1:


Fatal trap 12: page fault while in kernel mode
cpuid = 3; apic id = 03
fault virtual address	= 0xfffffe00c5e00008
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff80af1d90
stack pointer	        = 0x28:0xfffffe00c57a6ba0
frame pointer	        = 0x28:0xfffffe00c57a6ba0
code segment		= base 0x0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= resume, IOPL = 0
current process		= 11 (idle: cpu3)
rdi: fffff80005ebf800 rsi: 0000000000000240 rdx: 0000000000000000
rcx: 0000000000000000  r8: 0000000000002000  r9: fffffe001e98f000
rax: fffffe00c5dfe000 rbx: fffff80005db6800 rbp: fffffe00c57a6ba0
r10: fffffe001e98fa30 r11: 0000000000000001 r12: 0000000000000003
r13: 0000000000001ec0 r14: fffffe00c59bb000 r15: 0000000000000000
trap number		= 12
panic: page fault
cpuid = 3
time = 1701779519
KDB: enter: panic

Panic 2:

Fatal trap 12: page fault while in kernel mode
cpuid = 3; apic id = 03
fault virtual address	= 0xfffffe00c5e00008
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff80af1d90
stack pointer	        = 0x28:0xfffffe0120f61c90
frame pointer	        = 0x28:0xfffffe0120f61c90
code segment		= base 0x0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= resume, IOPL = 0
current process		= 0 (wg_tqg_3)
rdi: fffff80005e80800 rsi: 0000000000000240 rdx: 0000000000000000
rcx: 0000000000000000  r8: 0000000000002000  r9: fffffe0120f62000
rax: fffffe00c5dfe000 rbx: fffff80005db6800 rbp: fffffe0120f61c90
r10: 00000000000001f4 r11: 00000000800b1470 r12: 0000000000000003
r13: 0000000000001ec0 r14: fffffe00c59bb000 r15: 0000000000000000
trap number		= 12
panic: page fault
cpuid = 3
time = 1701788100
KDB: enter: panic
Uptime: 2h11m12s

It's the same as this thread: https://forum.netgate.com/topic/182898/crash-report-14-0-current-freebsd-14-0-current-1-releng_2_7_0-n255866-686c8d3c1f0/
Though I'm still not convinced the cause there was bad ram. Unless maybe if you are also using a Dell server as host.

Steve

security_sharezone

@stephenw10 the hypervisor vmware 8.0 running on PowerEdge R630, DBE and ram on IDRAC show is ok
the physical network is Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet and virtual interafce on vm pfsense are vmxnet with openvmtoolls installed

stephenw10

The only other thing it looks like it is this: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=239118

But that is old and we have that patch anyway: https://github.com/pfsense/FreeBSD-src/commit/4166913371c9be822cfa34419c397f711de83b49

security_sharezone

@stephenw10 said in pfsense reboot randomly on vmware:

But that is old and we have that patch anyway

how do i apply the patch? can you help me?

security_sharezone

@security_sharezone becasue on my system not have directory sys
i have only dev but not present vmware

bmeeks

@security_sharezone said in pfsense reboot randomly on vmware:

@stephenw10 said in pfsense reboot randomly on vmware:

But that is old and we have that patch anyway

how do i apply the patch? can you help me?

He means the patch is already included in the current pfSense operating system code. It does not need to be applied. That bug is old enough that fix for it was included in the latest pfSense kernel builds.

And binary patches like that at the OS level are not something you can easily apply anyway.

stephenw10

That's not a patch you can apply. It's applied to the source and we already have it in. I only pointed that out because that's the only other thing that's close to that backtrace.

security_sharezone

@bmeeks said in pfsense reboot randomly on vmware:

He means the patch is already included in the current pfSense operating system code. It does not need to be applied. That bug is old enough that fix for it was included in the latest pfSense kernel builds.

And binary patches like that at the OS level are not something you can easily apply anyway.

unfortunately it keeps rebooting. last time it lasted 55 minutes. but could it be that i'm averaging 10mb in wireguard vpn and it's the vpn that's crashing?

stephenw10

Possibly. One of the panics shows it's in wg at the time.

Do you have any other crash reports? Does it always show the same backtrace?

security_sharezone

@stephenw10 in this moment i reset and delete all file and log . i waiting another crash .

security_sharezone

@STEPHENW10 on attacchment crach dump .

I've been waiting to collect some logs and understand the problem. but I can't get to the bottom of it. I'd analyse the wireguard vpn. but I really can't make heads or tails of it :-). can you think of any ideas?crash_dump.txt

stephenw10

Only one of those 4 crashes seems to be related to wireguard. Are you able to test just disabling WireGuard?

Another good test here would be swapping the NIC that's on to something other than VMX3 in the hypervisor. That would require quite a few changes though.

security_sharezone

@stephenw10 said in pfsense reboot randomly on vmware:

Only one of those 4 crashes seems to be related to wireguard. Are you able to test just disabling WireGuard?

Another good test here would be swapping the NIC that's on to something other than VMX3 in the hypervisor. That would require quite a few changes though.

i disabled all backup jobs via veeam that used wireguard. changing vmxnet3 to e1000 would also have a drop in performance

stephenw10

Did you actually disable WireGuard befe those crashes? One pf them is in the wg process shortly before it crashes.

It would still be a good test to switch to e1000.

security_sharezone

@stephenw10 said in pfsense reboot randomly on vmware:

Did you actually disable WireGuard befe those crashes? One pf them is in the wg process shortly before it crashes.

It would still be a good test to switch to e1000.

no disabled now ... monitoring wireguard tunnels without traffic .. switching to e1000 a big job .. i'll make a vm clone and then convert everything

stephenw10

Yes, it's a significant task. But it would prove the issue is in the vmxnet3 driver. Or disprove it.

security_sharezone

@stephenw10

I will definitely do it. my steps are

wireguard down for two days and see if problem backup job
change e1000

I will give feedback. if a solution is found in the meantime I would be happy

security_sharezone

@stephenw10 in this moment 20 hours without reboot .
wireguard is up but without backup running . low traffic transit .