Random pfSense crash after running for a week with no issues
-
I have used pfSense for over 5 years and have never run into this issue, but I just bought a new Topton N6000 router and moved from baremetal to running ESXI on the Topton. I've virtualized pfSense in the past with no issues, but am now questioning whether to go back to baremetal or not. The main reasons I went with ESXI was for the i226 NIC support as well as the ability to run backups to my Synology NAS.
So, at first, I believe that the Synology Active Backup for Business service was killing my WAN, since it consistently died shortly after the backup finished. Basically, ESXi creates a snapshot of the VM, and I believe that when consolidating this snapshot at the end of the backup, something weird happened to the WAN interface and it would go down. I installed a Cron job that just pings out every 5 minutes and if it cannot resolve an IP address from google.com and yahoo.com, it restarts the interface. Since implementing that, all seems to be fine. I know that the WAN went down at least one time after I disabled the Synology backup service,, but it seemed random and not at a consistent time like it was.
Today, on the other hand, I was at work and got alerts that my entire network went down. I figured that the Cron job would fix it, but it never did. When I finally rebooted, pfSense came up with a crash report, and I have gone through it and cannot find anything that I see to be a red flag.
If anyone is better with these crash logs than me, please let me know if you see anything that I should address first to. I'm kinda hoping I am just missing something blatantly obvious when it comes to virtualizing pfSense, since it has been a few years since then.
-
There are three panics shown there in the message buffer but only one backrtrace:
db:1:pfs> bt Tracing pid 0 tid 100306 td 0xfffffe010b2611e0 kdb_enter() at kdb_enter+0x32/frame 0xfffffe010aaaeb20 vpanic() at vpanic+0x183/frame 0xfffffe010aaaeb70 panic() at panic+0x43/frame 0xfffffe010aaaebd0 trap_fatal() at trap_fatal+0x409/frame 0xfffffe010aaaec30 trap_pfault() at trap_pfault+0x4f/frame 0xfffffe010aaaec90 calltrap() at calltrap+0x8/frame 0xfffffe010aaaec90 --- trap 0xc, rip = 0xffffffff84138690, rsp = 0xfffffe010aaaed68, rbp = 0xfffffe010aaaee40 --- wg_send() at wg_send/frame 0xfffffe010aaaee40 gtaskqueue_run_locked() at gtaskqueue_run_locked+0x15d/frame 0xfffffe010aaaeec0 gtaskqueue_thread_loop() at gtaskqueue_thread_loop+0xc3/frame 0xfffffe010aaaeef0 fork_exit() at fork_exit+0x7d/frame 0xfffffe010aaaef30 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe010aaaef30 --- trap 0, rip = 0, rsp = 0, rbp = 0 --- db:1:pfs> show registers cs 0x20 ds 0x3b es 0x3b fs 0x13 gs 0x1b ss 0x28 rax 0x12 rcx 0x1 rdx 0xfffffe010aaae740 rbx 0x100 rsp 0xfffffe010aaaeb20 rbp 0xfffffe010aaaeb20 rsi 0x19 rdi 0xffffffff82d836d8 vt_conswindow+0x10 r8 0 r9 0x304f _binary_elf_vdso_so_1_size+0x2a3f r10 0xffffffff82d83818 vt_consdev r11 0xcedfc2df9afff59c r12 0 r13 0xfffffe010aaaeca0 r14 0xfffffe010aaaebb0 r15 0xfffffe010b2611e0 rip 0xffffffff80d48ff2 kdb_enter+0x32 rflags 0x82 kdb_enter+0x32: movq $0,0x2342e13(%rip) db:1:pfs> show pcpu cpuid = 2 dynamic pcpu = 0xfffffe00981f6580 curthread = 0xfffffe010b2611e0: pid 0 tid 100306 critnest 1 "wg_tqg_2" curpcb = 0xfffffe010b261700 fpcurthread = none idlethread = 0xfffffe001b1b5560: tid 100005 "idle: cpu2" self = 0xffffffff84012000 curpmap = 0xffffffff8303ff50 tssp = 0xffffffff84012384 rsp0 = 0xfffffe010aaaf000 kcr3 = 0xffffffffffffffff ucr3 = 0xffffffffffffffff scr3 = 0x0 gs32p = 0xffffffff84012404 ldt = 0xffffffff84012444 tss = 0xffffffff84012434 curvnet = 0
Two of the panics appear to be in Wirewguard so you might try running without that enabled as a test.
However it looks like you're running a Jasper Lake CPU and they have known issues with virtulisation. Make sure you have the current BIOS/microcode running there.
I'm not sure if it applies directly to ESXi but still should be checked: https://forums.servethehome.com/index.php?threads/jasper-lake-proxmox-kvm-qemu-vm-guest-stability.38824/Steve
-
@stephenw10 said in Random pfSense crash after running for a week with no issues:
https://forums.servethehome.com/index.php?threads/jasper-lake-proxmox-kvm-qemu-vm-guest-stability.38824/
Thanks! That definitely gives me something to go off of, and would make sense. I have had days where I wake up and check my phone, and have no issues visiting one or two sites, then the WAN dies. Makes me think that the CPU is likely idle for hours, then finally tries to 'wake up', then crashes, or at least has issues bringing up my WAN NIC.
You definitely take a risk buying one of these Topton boxes from China in terms of firmware / microcode, and it looks like I am likely going to have to go down the path of getting it up to date, which they don't make easy since they don't have a website for technical support..
-
I had extremely bad experiences with Topton, XCY and other cheap Chinese appliances. They randomly reboot/crash and have overheating issues. If you need a cheap unit, go with Qotom.
-
First try disabling CPU power saving modes in the BIOS and see if that changes anything.