Random pfSense crash after running for a week with no issues

MReprogle

I have used pfSense for over 5 years and have never run into this issue, but I just bought a new Topton N6000 router and moved from baremetal to running ESXI on the Topton. I've virtualized pfSense in the past with no issues, but am now questioning whether to go back to baremetal or not. The main reasons I went with ESXI was for the i226 NIC support as well as the ability to run backups to my Synology NAS.

So, at first, I believe that the Synology Active Backup for Business service was killing my WAN, since it consistently died shortly after the backup finished. Basically, ESXi creates a snapshot of the VM, and I believe that when consolidating this snapshot at the end of the backup, something weird happened to the WAN interface and it would go down. I installed a Cron job that just pings out every 5 minutes and if it cannot resolve an IP address from google.com and yahoo.com, it restarts the interface. Since implementing that, all seems to be fine. I know that the WAN went down at least one time after I disabled the Synology backup service,, but it seemed random and not at a consistent time like it was.

Today, on the other hand, I was at work and got alerts that my entire network went down. I figured that the Cron job would fix it, but it never did. When I finally rebooted, pfSense came up with a crash report, and I have gone through it and cannot find anything that I see to be a red flag.

If anyone is better with these crash logs than me, please let me know if you see anything that I should address first to. I'm kinda hoping I am just missing something blatantly obvious when it comes to virtualizing pfSense, since it has been a few years since then.

textdump.tar.0

stephenw10

There are three panics shown there in the message buffer but only one backrtrace:

db:1:pfs> bt
Tracing pid 0 tid 100306 td 0xfffffe010b2611e0
kdb_enter() at kdb_enter+0x32/frame 0xfffffe010aaaeb20
vpanic() at vpanic+0x183/frame 0xfffffe010aaaeb70
panic() at panic+0x43/frame 0xfffffe010aaaebd0
trap_fatal() at trap_fatal+0x409/frame 0xfffffe010aaaec30
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe010aaaec90
calltrap() at calltrap+0x8/frame 0xfffffe010aaaec90
--- trap 0xc, rip = 0xffffffff84138690, rsp = 0xfffffe010aaaed68, rbp = 0xfffffe010aaaee40 ---
wg_send() at wg_send/frame 0xfffffe010aaaee40
gtaskqueue_run_locked() at gtaskqueue_run_locked+0x15d/frame 0xfffffe010aaaeec0
gtaskqueue_thread_loop() at gtaskqueue_thread_loop+0xc3/frame 0xfffffe010aaaeef0
fork_exit() at fork_exit+0x7d/frame 0xfffffe010aaaef30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe010aaaef30
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
db:1:pfs>  show registers
cs                        0x20
ds                        0x3b
es                        0x3b
fs                        0x13
gs                        0x1b
ss                        0x28
rax                       0x12
rcx                        0x1
rdx         0xfffffe010aaae740
rbx                      0x100
rsp         0xfffffe010aaaeb20
rbp         0xfffffe010aaaeb20
rsi                       0x19
rdi         0xffffffff82d836d8  vt_conswindow+0x10
r8                           0
r9                      0x304f  _binary_elf_vdso_so_1_size+0x2a3f
r10         0xffffffff82d83818  vt_consdev
r11         0xcedfc2df9afff59c
r12                          0
r13         0xfffffe010aaaeca0
r14         0xfffffe010aaaebb0
r15         0xfffffe010b2611e0
rip         0xffffffff80d48ff2  kdb_enter+0x32
rflags                    0x82
kdb_enter+0x32: movq    $0,0x2342e13(%rip)
db:1:pfs>  show pcpu
cpuid        = 2
dynamic pcpu = 0xfffffe00981f6580
curthread    = 0xfffffe010b2611e0: pid 0 tid 100306 critnest 1 "wg_tqg_2"
curpcb       = 0xfffffe010b261700
fpcurthread  = none
idlethread   = 0xfffffe001b1b5560: tid 100005 "idle: cpu2"
self         = 0xffffffff84012000
curpmap      = 0xffffffff8303ff50
tssp         = 0xffffffff84012384
rsp0         = 0xfffffe010aaaf000
kcr3         = 0xffffffffffffffff
ucr3         = 0xffffffffffffffff
scr3         = 0x0
gs32p        = 0xffffffff84012404
ldt          = 0xffffffff84012444
tss          = 0xffffffff84012434
curvnet      = 0

Two of the panics appear to be in Wirewguard so you might try running without that enabled as a test.

However it looks like you're running a Jasper Lake CPU and they have known issues with virtulisation. Make sure you have the current BIOS/microcode running there.
I'm not sure if it applies directly to ESXi but still should be checked: https://forums.servethehome.com/index.php?threads/jasper-lake-proxmox-kvm-qemu-vm-guest-stability.38824/

Steve

MReprogle

@stephenw10 said in Random pfSense crash after running for a week with no issues:

https://forums.servethehome.com/index.php?threads/jasper-lake-proxmox-kvm-qemu-vm-guest-stability.38824/

Thanks! That definitely gives me something to go off of, and would make sense. I have had days where I wake up and check my phone, and have no issues visiting one or two sites, then the WAN dies. Makes me think that the CPU is likely idle for hours, then finally tries to 'wake up', then crashes, or at least has issues bringing up my WAN NIC.

You definitely take a risk buying one of these Topton boxes from China in terms of firmware / microcode, and it looks like I am likely going to have to go down the path of getting it up to date, which they don't make easy since they don't have a website for technical support..

nimrod

I had extremely bad experiences with Topton, XCY and other cheap Chinese appliances. They randomly reboot/crash and have overheating issues. If you need a cheap unit, go with Qotom.

stephenw10

First try disabling CPU power saving modes in the BIOS and see if that changes anything.