page fault kernel panics after 2.5.2 upgrade

doubledgedboard

Hello,

I'm using a dedicated baremetal for pfsense and after upgrading from 2.5.1 to 2.5.2 recently, started encountering page fault kernel panics which required a reboot.

It's not clear to me what the correct place or procedure is for posting crash dumps so I'm posting in general in hopes others can advise me. I've done some googling around and searching the forum but didn't find anyone else with the same issue or a clear path for diagnosis.

So far the reboots are intermittent. I've had 3 so far, 2 of which I captured dumps from the GUI for, attached.

Example from crash log:

Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address	= 0xffffd808141bc640
fault code		= supervisor write data, page not present
instruction pointer	= 0x20:0xffffffff8137093e
stack pointer	        = 0x0:0xfffffe009c94b830
frame pointer	        = 0x0:0xfffffe009c94b900
code segment		= base 0x0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 43519 (sh)
trap number		= 12
panic: page fault
cpuid = 0
time = 1632463263
KDB: enter: panic

This baremetal was rock solid before the last upgrade so I'm a little hesitant to think it's a hardware problem, but I'm open to suggestions.

textdump.0.tar

textdump.1.tar

stephenw10

The backtrace on those is nearly identical and not very helpful:

db:0:kdb.enter.default>  bt
Tracing pid 31667 tid 100251 td 0xfffff8015da47000
kdb_enter() at kdb_enter+0x37/frame 0xfffffe009c9644f0
vpanic() at vpanic+0x197/frame 0xfffffe009c964540
panic() at panic+0x43/frame 0xfffffe009c9645a0
trap_fatal() at trap_fatal+0x391/frame 0xfffffe009c964600
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe009c964650
trap() at trap+0x286/frame 0xfffffe009c964760
calltrap() at calltrap+0x8/frame 0xfffffe009c964760
--- trap 0xc, rip = 0xffffffff8137093e, rsp = 0xfffffe009c964830, rbp = 0xfffffe009c964900 ---
pmap_enter() at pmap_enter+0x96e/frame 0xfffffe009c964900
vm_fault() at vm_fault+0x1aa5/frame 0xfffffe009c964a50
vm_fault_trap() at vm_fault_trap+0x60/frame 0xfffffe009c964a90
trap_pfault() at trap_pfault+0x19c/frame 0xfffffe009c964ae0
trap() at trap+0x410/frame 0xfffffe009c964bf0
calltrap() at calltrap+0x8/frame 0xfffffe009c964bf0
--- trap 0xc, rip = 0x80028681b, rsp = 0x7fffffffe9a0, rbp = 0x7fffffffe9a0 ---
db:0:kdb.enter.default>  ps

The first crash appears to be in 2.5.1. The second one in 2.5.2 and nearlty identical so I don't think it's anything to do with the upgrade.

The first thing I would do here is disable hardware features you don't need like the sound card and firewire. And the Atheros NIC? Looks like that is unused (down).

Steve

doubledgedboard

@stephenw10

So what would be more useful for debugging the source?

Based on the documentation, it says that page faults are usually a kernel issue (cause the system itself isn't going completely unresponsive, but it's still successfully saving a dump etc), and thus not a hardware issue.

Are you saying it could still be a hardware issue?

stephenw10

It could be. But comparing a number of back-traces would easily confirm if it's not.

doubledgedboard

@stephenw10

Okay thanks. I'll start digging into disabling the hardware you mentioned. The atheros NIC is a management port and usually disconnected.

I attached my latest dump file for posterity

info.2.tar
textdump.2.tar

stephenw10

Mmm, that's even less helpful unfortunately:

db:0:kdb.enter.default>  bt
Tracing pid 98065 tid 100299 td 0xfffff800b9c90000
kdb_enter() at kdb_enter+0x37/frame 0xfffffe009e840980
vpanic() at vpanic+0x197/frame 0xfffffe009e8409d0
panic() at panic+0x43/frame 0xfffffe009e840a30
trap_fatal() at trap_fatal+0x391/frame 0xfffffe009e840a90
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe009e840ae0
trap() at trap+0x410/frame 0xfffffe009e840bf0
calltrap() at calltrap+0x8/frame 0xfffffe009e840bf0
--- trap 0xc, rip = 0x8002867c0, rsp = 0x7fffffffe9a0, rbp = 0x7fffffffe9a0 ---
db:0:kdb.enter.default>  ps

And nothing significant in the msg buffer either.

doubledgedboard

adding another crash dump for posterity -- still investigating

textdump.3.tar

doubledgedboard

I did some general research into how to debug this and one of the things I encountered suggested having the symbols for the kernel present (in /boot/kernel/) as well as the sources in (/usr/src), both of which don't seem to be present in my pfSense install.

So I'm trying to figure out how to address that to see if I can improve the ability to debug or at least provide a useful backtrace from these dumps.

stephenw10

Well they do all look very similar at least. That implies it's probably not a hardware issue.

One thing you could try here is loading the debug kernel:
https://files00.netgate.com/packages/pfSense_v2_5_2_amd64-core/All/pfSense-kernel-debug-pfSense-2.5.2.r.20210613.1712.txz

But be aware almost no-one is running that. You may well see other issues. I would not recommend running that on a production firewall.

Steve

doubledgedboard

Well I haven't had any more crashes for the past few days...

What did I do?

Unplugged the keyboard/mouse and monitor cable. I suspect one of those peripherals was leading to an occasional hiccup. I don't have a KVM so I have a single KB/Mouse for two server machines, and usually I'm manually swapping them around on occasion.

My theory is that I was usually leaving it connected to the other server, but changed and left it connected to the router server, and perhaps this was leading to crashes over time due to instability with the peripherals or the video driver.

Anyway, hopefully I don't post in this again which means that was the problem and I solved it, otherwise I'll post back again if it wasn't

stephenw10

Hmm, well that would be odd but one of those troubleshooting cases where the cause comes from some seemingly unrelated thing. Leaky microwave, vacuum cleaner in the UPS etc.

doubledgedboard

@stephenw10

Whelp it crashed again, I guess it was wishful thinking after all. I got lucky with a few days without crashes

stephenw10

Same backtrace?

doubledgedboard

@stephenw10 Yeah probably, attached

db:0:kdb.enter.default>  bt
Tracing pid 63867 tid 100290 td 0xfffff8009646f740
kdb_enter() at kdb_enter+0x37/frame 0xfffffe009e822980
vpanic() at vpanic+0x197/frame 0xfffffe009e8229d0
panic() at panic+0x43/frame 0xfffffe009e822a30
trap_fatal() at trap_fatal+0x391/frame 0xfffffe009e822a90
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe009e822ae0
trap() at trap+0x410/frame 0xfffffe009e822bf0
calltrap() at calltrap+0x8/frame 0xfffffe009e822bf0
--- trap 0xc, rip = 0x8002867c0, rsp = 0x7fffffffe9a0, rbp = 0x7fffffffe9a0 ---

textdump.4.tar

stephenw10

Mmm, still nothing leading up to the trap and nothing show on the console.
Hard to say what that might be with nothing to go on really.

doubledgedboard

I may have solved the issue, although I'm probably tempting fate by claiming it so soon.

The issue persisted for some time, at first it was very periodic, approximately three days between panics, which is why I wasn't completely sold on a hardware problem yet.

I tried seeing if restarting "ahead of schedule" would give me three extra days (from last normal restart), but it still panic'd only a day later.

Eventually it naturally restarted sooner than three days.

Last night it started restarting every few minutes, and then suddenly it was restarting before it could even finish booting.

Aha!

Classic symptoms of a power supply issue...

I replaced the PSU (circa 2004) and it's been online ever since. I'll check back in a week and if it still hasn't panicked then I'll call that the issue.

doubledgedboard

I just can't win...

It rebooted last night. It wasn't the power supply.

doubledgedboard

I'm about to hit 7 days uptime so I think I finally found the issue.

I started pulling memory sticks out one by one and waiting for it to restart.

I suspect I have at least one bad stick of ram.

Posting this for posterity for anyone else who runs into this type of issue.

MrPete

@doubledgedboard For future browsers: it's always a good idea to do an intense RAM test.

FWIW, the folks at memtest86 dot com have recently done massive updates / upgrades to the (free) RAM tester.

I recently had a situation where RAM passed a few-years-old version of memtest... but with the latest version, it immediately was detected as bad.

I strongly encourage everybody to grab a current version :)

doubledgedboard

@mrpete Oh for sure, I've been using memtest and variants for years

the issue here is that the system required near 24/7 uptime and I didn't have the time to take it down to run 8+ hour long memory tests, so I had to do what I could while maintaining uptime

(and for posterity, I'm back up to 88 days of uptime now )