pfSense Crash at a randem time and wont fully reboot

shaddow

Hello All

I have a problem and it might have been answered some wear else so if so please point me in the right direction.

One of the pfsense boxes gets a crash at random times and when it reboots it does not finish the boot.

Yes all is up and running as such but pfblockerng will not do a update as it states its still in boot mode.
Also openvpn will not allow users to connect, only Peer to Peer.

It doesn't get to the command line in the console, but if I CtrC it goes to sh.

Of note the unit is on xcp-ng, its been working great for so long but now days it comes up with this problem, the last took 20 days before this happened, and only this virtual, not the others.

The xcp-ng still has so much resource not in use.

I have attached the dump file textdump.tar

when it does reboot I see upto the error of xenguest but after that nothing but the "config_aqm".
I have got to the point of having a working image to revert back too, this is the only quick way to get the thing back up and running for now.

But I hope someone can help

Shane.

stephenw10

The important parts of that crash are the the backtrace:

db:0:kdb.enter.default>  bt
Tracing pid 12 tid 100122 td 0xfffff80008304000
kdb_enter() at kdb_enter+0x37/frame 0xfffffe00020cf240
vpanic() at vpanic+0x197/frame 0xfffffe00020cf290
panic() at panic+0x43/frame 0xfffffe00020cf2f0
trap_fatal() at trap_fatal+0x391/frame 0xfffffe00020cf350
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00020cf3a0
trap() at trap+0x286/frame 0xfffffe00020cf4b0
calltrap() at calltrap+0x8/frame 0xfffffe00020cf4b0
--- trap 0xc, rip = 0xffffffff8109c7fa, rsp = 0xfffffe00020cf580, rbp = 0xfffffe00020cf5f0 ---
pf_test_state_udp() at pf_test_state_udp+0x2ba/frame 0xfffffe00020cf5f0
pf_test() at pf_test+0x1db8/frame 0xfffffe00020cf830
pf_check_in() at pf_check_in+0x1d/frame 0xfffffe00020cf850
pfil_run_hooks() at pfil_run_hooks+0xa1/frame 0xfffffe00020cf8f0
ip_tryforward() at ip_tryforward+0x193/frame 0xfffffe00020cf970
ip_input() at ip_input+0x3fe/frame 0xfffffe00020cfa20
netisr_dispatch_src() at netisr_dispatch_src+0xca/frame 0xfffffe00020cfa70
ether_demux() at ether_demux+0x16a/frame 0xfffffe00020cfaa0
ether_nh_input() at ether_nh_input+0x330/frame 0xfffffe00020cfb00
netisr_dispatch_src() at netisr_dispatch_src+0xca/frame 0xfffffe00020cfb50
ether_input() at ether_input+0x4b/frame 0xfffffe00020cfb80
xn_rxeof() at xn_rxeof+0x55d/frame 0xfffffe00020cfc50
xn_intr() at xn_intr+0x58/frame 0xfffffe00020cfc90
ithread_loop() at ithread_loop+0x23c/frame 0xfffffe00020cfcf0
fork_exit() at fork_exit+0x7e/frame 0xfffffe00020cfd30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00020cfd30
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

And the panic:

Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address	= 0x0
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff8109c7fa
stack pointer	        = 0x0:0xfffffe00020cf580
frame pointer	        = 0x0:0xfffffe00020cf5f0
code segment		= base 0x0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 12 (irq2343: xn1)
trap number		= 12
panic: page fault
cpuid = 0
time = 1656263327
KDB: enter: panic

But also we can see the logs show a lot of:

config_aqm Unable to configure flowset, flowset busy!
config_aqm Unable to configure flowset, flowset busy!

I assume you're running Limiters. How are they configured?

It looks you're running 2.5.2 there, is there any specific reason for that?

If it is some bug you're hitting it will not be fixed in 2.5.2. You should upgrade to 2.6 or 22.05.

Steve

shaddow

Sorry for the slow reply

I am running 2.5 as for some reason it would not install 2.6 over its 2.5 but I am looking at upgrade soon.
The config_aqm is because of limiters at the time.
But what I found on my side was when the page fault happened it did not fully reboot, I had removed a few functions like snmp and it became stable for a bit but in the last two days its happened again, but at least its booting fully now.

I have added the dumps.

11Aug
textdump.tar.0
info.0

12Aug
textdump.tar.0
info.0

stephenw10

Hmm, well the good thing there is that those are all identical backtraces so it almost certainly is a software bug of some sort.
The bad news is that it looks to be in the xn(4) driver which is a lot less common than it once was.

However since it also appears to be in udp the first thing to do here is make sure you disable all hardware off-loading in Sys > Adv > Networking. There are known bugs there.

Steve

shaddow

Steve

First up, Thanks for the help.

On the settings, what is checked is
Disable hardware checksum offload,
Disable hardware TCP segmentation offload,
Disable hardware large receive offload,

But Also ticked is "Enable the ALTQ support for hn NICs"

But also I could change the nic from Intel e1000 to Realtek rtl8139 as at the moment this is on XCP-ng but not sure if that would work, what do you think.

I am looking at doing a full install of 2.6 I hope in a month or so, but this is a production machine and so I have to do this during a quiet time in the office.

Shane

stephenw10

@shaddow said in pfSense Crash at a randem time and wont fully reboot:

Enable the ALTQ support for hn NICs

That only does anything for hn(4) NICs so Hyper-V or Azure. It doesn't matter here.

pfSense only sees the Xen NIC so changing it from Intel to Realtek would only make any difference if you enabled hardware pass through.

Check the output of: ifconfig -vm xn0
Make sure the hardware off-loading options are actually disabled.

Steve