pfSense Crash at a randem time and wont fully reboot
-
Hello All
I have a problem and it might have been answered some wear else so if so please point me in the right direction.
One of the pfsense boxes gets a crash at random times and when it reboots it does not finish the boot.
Yes all is up and running as such but pfblockerng will not do a update as it states its still in boot mode.
Also openvpn will not allow users to connect, only Peer to Peer.It doesn't get to the command line in the console, but if I CtrC it goes to sh.
Of note the unit is on xcp-ng, its been working great for so long but now days it comes up with this problem, the last took 20 days before this happened, and only this virtual, not the others.
The xcp-ng still has so much resource not in use.
I have attached the dump file textdump.tar
when it does reboot I see upto the error of xenguest but after that nothing but the "config_aqm".
I have got to the point of having a working image to revert back too, this is the only quick way to get the thing back up and running for now.But I hope someone can help
Shane.
-
The important parts of that crash are the the backtrace:
db:0:kdb.enter.default> bt Tracing pid 12 tid 100122 td 0xfffff80008304000 kdb_enter() at kdb_enter+0x37/frame 0xfffffe00020cf240 vpanic() at vpanic+0x197/frame 0xfffffe00020cf290 panic() at panic+0x43/frame 0xfffffe00020cf2f0 trap_fatal() at trap_fatal+0x391/frame 0xfffffe00020cf350 trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00020cf3a0 trap() at trap+0x286/frame 0xfffffe00020cf4b0 calltrap() at calltrap+0x8/frame 0xfffffe00020cf4b0 --- trap 0xc, rip = 0xffffffff8109c7fa, rsp = 0xfffffe00020cf580, rbp = 0xfffffe00020cf5f0 --- pf_test_state_udp() at pf_test_state_udp+0x2ba/frame 0xfffffe00020cf5f0 pf_test() at pf_test+0x1db8/frame 0xfffffe00020cf830 pf_check_in() at pf_check_in+0x1d/frame 0xfffffe00020cf850 pfil_run_hooks() at pfil_run_hooks+0xa1/frame 0xfffffe00020cf8f0 ip_tryforward() at ip_tryforward+0x193/frame 0xfffffe00020cf970 ip_input() at ip_input+0x3fe/frame 0xfffffe00020cfa20 netisr_dispatch_src() at netisr_dispatch_src+0xca/frame 0xfffffe00020cfa70 ether_demux() at ether_demux+0x16a/frame 0xfffffe00020cfaa0 ether_nh_input() at ether_nh_input+0x330/frame 0xfffffe00020cfb00 netisr_dispatch_src() at netisr_dispatch_src+0xca/frame 0xfffffe00020cfb50 ether_input() at ether_input+0x4b/frame 0xfffffe00020cfb80 xn_rxeof() at xn_rxeof+0x55d/frame 0xfffffe00020cfc50 xn_intr() at xn_intr+0x58/frame 0xfffffe00020cfc90 ithread_loop() at ithread_loop+0x23c/frame 0xfffffe00020cfcf0 fork_exit() at fork_exit+0x7e/frame 0xfffffe00020cfd30 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00020cfd30 --- trap 0, rip = 0, rsp = 0, rbp = 0 ---
And the panic:
Fatal trap 12: page fault while in kernel mode cpuid = 0; apic id = 00 fault virtual address = 0x0 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff8109c7fa stack pointer = 0x0:0xfffffe00020cf580 frame pointer = 0x0:0xfffffe00020cf5f0 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 12 (irq2343: xn1) trap number = 12 panic: page fault cpuid = 0 time = 1656263327 KDB: enter: panic
But also we can see the logs show a lot of:
config_aqm Unable to configure flowset, flowset busy! config_aqm Unable to configure flowset, flowset busy!
I assume you're running Limiters. How are they configured?
It looks you're running 2.5.2 there, is there any specific reason for that?
If it is some bug you're hitting it will not be fixed in 2.5.2. You should upgrade to 2.6 or 22.05.
Steve
-
Sorry for the slow reply
I am running 2.5 as for some reason it would not install 2.6 over its 2.5 but I am looking at upgrade soon.
The config_aqm is because of limiters at the time.
But what I found on my side was when the page fault happened it did not fully reboot, I had removed a few functions like snmp and it became stable for a bit but in the last two days its happened again, but at least its booting fully now.I have added the dumps.
11Aug
textdump.tar.0
info.012Aug
textdump.tar.0
info.0 -
Hmm, well the good thing there is that those are all identical backtraces so it almost certainly is a software bug of some sort.
The bad news is that it looks to be in the xn(4) driver which is a lot less common than it once was.However since it also appears to be in udp the first thing to do here is make sure you disable all hardware off-loading in Sys > Adv > Networking. There are known bugs there.
Steve
-
Steve
First up, Thanks for the help.
On the settings, what is checked is
Disable hardware checksum offload,
Disable hardware TCP segmentation offload,
Disable hardware large receive offload,But Also ticked is "Enable the ALTQ support for hn NICs"
But also I could change the nic from Intel e1000 to Realtek rtl8139 as at the moment this is on XCP-ng but not sure if that would work, what do you think.
I am looking at doing a full install of 2.6 I hope in a month or so, but this is a production machine and so I have to do this during a quiet time in the office.
Shane
-
@shaddow said in pfSense Crash at a randem time and wont fully reboot:
Enable the ALTQ support for hn NICs
That only does anything for hn(4) NICs so Hyper-V or Azure. It doesn't matter here.
pfSense only sees the Xen NIC so changing it from Intel to Realtek would only make any difference if you enabled hardware pass through.
Check the output of:
ifconfig -vm xn0
Make sure the hardware off-loading options are actually disabled.Steve