Random crashes "Fatal trap 12: page fault while in kernel mode"
-
Are they able to log the console output and trigger the issue? That should get the backtrace unless it's a hard lock.
The issue is probably that PPPoE server uses netgraph and that interfaces with the hardware/drivers at a very low level. If they can easily reproduce the issue then testing it on ix would be a very good test to prove it out.
Steve
-
It seems it's not easily triggerable. Box rebooted this morning around 6:30am again with a trap. Log shows the same, trap occurs after a few logins from users to the pppoe server and the server setting/changing the interface names like 'poesX-Y' then directly afterwards threw a hissy fit and rebooted.
There's no core dump nor any trace intel. The system log is all we have. I now have SSH access to the box but can't find any other trace.
Also customer changed the NICs and swapped them around. LAN now is running on ix2 and WAN2 went to igc3 instead. So same thing is happening on the IX style interface with another driver.
... May 4 06:28:46 kernel frame pointer = 0x28:0xfffffe003fb647c0 May 4 06:28:46 kernel stack pointer = 0x28:0xfffffe003fb64780 May 4 06:28:46 kernel instruction pointer = 0x20:0xffffffff8109e252 May 4 06:28:46 kernel fault code = supervisor read data, page not present May 4 06:28:46 kernel fault virtual address = 0x18 May 4 06:28:46 kernel cpuid = 0; apic id = 04 May 4 06:28:46 kernel Fatal trap 12: page fault while in kernel mode May 4 06:28:46 kernel ng24: changing name to 'poes1-24' May 4 06:28:46 radiusd 6763 (43) Login OK: [yyy/xxx] (from client Self port 21 cli 3481c43dad07) May 4 06:28:44 kernel ng23: changing name to 'poes1-23' May 4 06:28:44 radiusd 6763 (42) Login OK: [yyy/xxx] (from client Self port 25 cli e8df70915c6e) May 4 06:28:44 kernel ng22: changing name to 'poes1-22' May 4 06:28:44 radiusd 6763 (41) Login OK: [yyy/xxx] (from client Self port 24 cli 7cff4dc592d7) May 4 06:28:43 kernel ng21: changing name to 'poes1-21' May 4 06:28:43 radiusd 6763 (40) Login OK: [yyy/xxx] (from client Self port 20 cli 2465118bbfb7) May 4 06:28:43 kernel ng20: changing name to 'poes1-20' May 4 06:28:43 radiusd 6763 (39) Login OK: [yyy/xxx] (from client Self port 19 cli e0286db734d7) May 4 06:28:43 kernel ng19: changing name to 'poes1-19' May 4 06:28:43 radiusd 6763 (38) Login OK: [yyy/xxx] (from client Self port 18 cli 989bcb93de7c) May 4 06:28:43 kernel ng18: changing name to 'poes1-18' May 4 06:28:43 radiusd 6763 (37) Login OK: [yyy/xxx] (from client Self port 17 cli 2c91ab0247d8)
Same happened while reconfiguration of the interface directly afterwards as all users came back to login immediatly. Somehow seems too many reconfigurations at one could trigger that? Or maybe some kind of race condition?
May 3 20:19:19 kernel ---<<BOOT>>--- May 3 20:19:19 syslogd kernel boot file is /boot/kernel/kernel May 3 20:17:38 kernel processor eflags = interrupt enabled, resume, May 3 20:17:38 kernel = DPL 0, pres 1, long 1, def32 0, gran 1 May 3 20:17:38 kernel code segment = base 0x0, limit 0xfffff, type 0x1b May 3 20:17:38 kernel frame pointer = 0x28:0xfffffe003fb647c0 May 3 20:17:38 kernel stack pointer = 0x28:0xfffffe003fb64780 May 3 20:17:38 kernel instruction pointer = 0x20:0xffffffff8109e252 May 3 20:17:38 kernel fault code = supervisor read data, page not present May 3 20:17:38 kernel fault virtual address = 0x18 May 3 20:17:38 kernel cpuid = 0; apic id = 04 May 3 20:17:38 kernel Fatal trap 12: page fault while in kernel mode May 3 20:17:38 kernel ng20: changing name to 'poes1-20' May 3 20:17:38 kernel ng19: changing name to 'poes1-19' May 3 20:17:37 kernel ng18: changing name to 'poes1-18' May 3 20:17:37 kernel ng17: changing name to 'poes1-17' May 3 20:17:37 kernel ng16: changing name to 'poes1-16' May 3 20:17:37 kernel ng15: changing name to 'poes1-15' May 3 20:17:37 kernel ng14: changing name to 'poes1-14' May 3 20:17:37 kernel ng12: changing name to 'poes1-12' May 3 20:17:37 kernel ng13: changing name to 'poes1-13' May 3 20:17:36 kernel ng11: changing name to 'poes1-11' May 3 20:17:36 kernel ng10: changing name to 'poes1-10' May 3 20:17:36 kernel ng9: changing name to 'poes1-9' May 3 20:17:36 kernel ng8: changing name to 'poes1-8' May 3 20:17:36 kernel ng7: changing name to 'poes1-7' May 3 20:17:36 kernel ng6: changing name to 'poes1-6' May 3 20:17:36 kernel ng5: changing name to 'poes1-5' May 3 20:17:36 kernel ng4: changing name to 'poes1-4' May 3 20:17:36 kernel ng3: changing name to 'poes1-3' May 3 20:17:35 kernel ng2: changing name to 'poes1-2' May 3 20:17:34 kernel ng1: changing name to 'poes1-1' May 3 20:17:30 radiusd 80534 (25) Login OK: [yyy/xxx] (from client Self port 26 cli 3810d5aeced3) May 3 20:17:30 radiusd 80534 (24) Login OK: [yyy/xxx] (from client Self port 24 cli e8df70915c6e) May 3 20:17:30 radiusd 80534 (23) Login OK: [yyy/xxx] (from client Self port 21 cli ccce1efed6a4) May 3 20:17:30 radiusd 80534 (22) Login OK: [yyy/xxx] (from client Self port 22 cli 3810d54da23c) May 3 20:17:30 radiusd 80534 (21) Login OK: [yyy/xxx] (from client Self port 20 cli 2c91abbba0bf) May 3 20:17:30 radiusd 80534 (20) Login OK: [yyy/xxx] (from client Self port 19 cli 989bcbae8132) May 3 20:17:30 radiusd 80534 (19) Login OK: [yyy/xxx] (from client Self port 18 cli 7cff4dc592d7) May 3 20:17:30 radiusd 80534 (18) Login OK: [yyy/xxx] (from client Self port 16 cli 1ced6f75b2df) May 3 20:17:29 radiusd 80534 (17) Login OK: [yyy/xxx] (from client Self port 14 cli 989bcb5de4e1) May 3 20:17:29 radiusd 80534 (16) Login OK: [yyy/xxx] (from client Self port 13 cli dc396f4c37a1) May 3 20:17:29 radiusd 80534 (15) Login OK: [yyy/xxx] (from client Self port 12 cli 2c91abbfc809) May 3 20:17:29 radiusd 80534 (14) Login OK: [yyy/xxx] (from client Self port 11 cli 989bcb73d0f7) May 3 20:17:29 radiusd 80534 (13) Login OK: [yyy/xxx] (from client Self port 10 cli 989bcb93de7c) May 3 20:17:29 radiusd 80534 (12) Login OK: [yyy/xxx] (from client Self port 9 cli dc15c84b2c26) May 3 20:17:29 radiusd 80534 (11) Login OK: [yyy/xxx] (from client Self port 8 cli 5c49797ab9e9) May 3 20:17:28 radiusd 80534 (10) Login OK: [yyy/xxx] (from client Self port 25 cli 5c49797901da) May 3 20:17:28 radiusd 80534 (9) Login OK: [yyy/xxx] (from client Self port 6 cli 5c497978c2e7) May 3 20:17:28 radiusd 80534 (8) Login OK: [yyy/xxx] (from client Self port 23 cli dc15c84803c2) May 3 20:17:28 radiusd 80534 (7) Login OK: [yyy/xxx] (from client Self port 5 cli e0286db734d7) May 3 20:17:28 radiusd 80534 (6) Login OK: [yyy/xxx] (from client Self port 4 cli 989bcb1a67bb) May 3 20:17:28 radiusd 80534 (5) Login OK: [yyy/xxx] (from client Self port 2 cli 2c91ab0247d8) May 3 20:17:28 radiusd 80534 (4) Login OK: [yyy/xxx] (from client Self port 1 cli e0286db33fb5) May 3 20:17:28 radiusd 80534 (3) Login OK: [yyy/xxx] (from client Self port 17 cli dc396f229ccd) May 3 20:17:28 radiusd 80534 (2) Login OK: [yyy/xxx] (from client Self port 15 cli ccce1e56e575) May 3 20:17:27 radiusd 80534 (1) Login OK: [yyy/xxx] (from client Self port 7 cli 3481c43dad07) May 3 20:17:26 radiusd 80534 (0) Login OK: [yyy/xxx] (from client Self port 3 cli 2c3afddd57fb)
-
Hmm, might need a console log to know more.
Reviewing this I did actually test a PPPoE server on igc when we were testing the 6100 internally. I never saw an issue but I only tested with a few clients so it could well be a condition only created by multiple simultaneous connections.Steve
-
@stephenw10 said in Random crashes "Fatal trap 12: page fault while in kernel mode":
Hmm, might need a console log to know more.
Tell me what you need and I'll try to get it :) I'm still on "potential race condition" or bug line but I'll be happy to shed some light on this either way. We checked for a core dump and as I thought - none is written, so that's a no I'm afraid.
Also strikes me odd, that after the mass-reconnection to the new PPPoE Server interface (ix) that immediatly triggered a trap. That's why I was thinking race cond. with multiple reconnects happening at once but that's hard to come by and is annoying the users so nothing one could "easily reproduce" in production. Still wondering why it doesn' happen more frequent but seemingly random.Cheers
\jens -
Mmm, if neither of those devices have an SSD then there is probably no SWAP partition configured and that's required for a crash dump.
The panic and backtrace will still be shown on the console though so you can get it bu hooking the console up to something and logging that output until it panics.I agree it seems likely to be a race condition caused by multiple simultaneous client connections. Unclear what in yet though.
I re-instated my PPPoE server on a 6100 and used a 4100 as the client, both via igc NICs. So far it's been solid but I'm unlikely to be able to trigger it.Steve
-
@stephenw10 Coming back to this as the problem persists, the customer had time to do, what TAC support told me.
He re-installed the 4100 that originally should be in place there from scratch. Imported config and switched back from the 6100 to the 4100 on site. Basically the same as above happened again after switching when multiple PPPoE requests came in. Now a few days later and a few reboots, too (sadly it did nothing to fix the problem as assumed), we finally have a coredump to share:
So at least the reinstall seemed to fix the "no core dump" situation at least
-
Ok, well it's not familiar to me but it's an unusual setup. I'll see if anyone else here has seen it.
-
One of our developers is looking at this. I have opened a bug report for it:
https://redmine.pfsense.org/issues/13210Steve
-
@stephenw10 Thanks Steve, will track it there
-
This fix is in 2.7 snapshots now, are you able to test that?
Steve
-
@stephenw10 We're already in touch with both, Netgate TAC and the customer to test the fix. We updated the location in question yesterday evening. Besides a problem after the upgrade to 22.05RC of the device not cleanly restarting/booting after upgrade (had that same problem thrice already) after a slight "powerloss-restart" procedure ;) it rebootet and upgraded fine. As far as I can say up intil now we have ~10h of no lockup, freeze or panic.
Not really in the clear yet as we had some big "looking good" windows when testing, too, but it seems to look promising!
knocks on wood
-
To add it here: Customer has updated to a newer RC-snapshot as the earlier got him a few report emails of the box for getting packet loss sometimes (not that often) and he wanted to check if that would be fixed, too.
On the latest RC snapshot thus far no problems to report. No crash dump, no freeze, no panic. Also the packet loss seems gone too :) So happy on both fronts for now - makes me happy to report that.Great job everyone involved. Shoutout to TAC support for their help and staying on the topic, too!
Cheers
\jens