Random crashes "Fatal trap 12: page fault while in kernel mode"

JeGr · May 3, 2022, 2:46 PM

Hi,

HW: SG-4100/SG-6100 (currently 6100)
SW: pfSense 22.01 latest

customer of ours has set up a new configuration and tested it on a brand new SG-4100 and -6100. Currently it's running on the 6100 but the result is the same. He has random crashes and reboots with "Fatal trap 12" every so often and sometimes as fast as minutes apart, sometimes it goes for a few hours to crash again.

A very similar config was running on an older non-Netgate box for months with 2.4.5 still in place but after talking to him about updating!!! he went further and bought a new box and re-created the config altogether so to not have old config things present.

He is running a PPPoE Server(!) on an internal IGC interface with FreeRadius (for user credentials) and has a few simple bandwith limiters set up that he uses on the rules on the PPPoE Server so his users dialing in gets the correct bandwith limits (he provides small internet access via [big ass antennas with] WiFi in regions that don't have cable, DSL or other alternatives).

Every time the box goes bad the log had the following to show:

May  3 08:22:06 Kyllburg sshguard[30940]: Now monitoring attacks.
May  3 08:43:24 Kyllburg radiusd[18033]: (37) Login OK: [xxx/yyy] (from client Self port 1 cli ccce1efed6a4)
May  3 08:43:24 Kyllburg kernel: ng1: changing name to 'poes1-1'
May  3 08:43:24 Kyllburg kernel:
May  3 08:43:24 Kyllburg kernel:
May  3 08:43:24 Kyllburg kernel: Fatal trap 12: page fault while in kernel mode
May  3 08:43:24 Kyllburg kernel: cpuid = 0; apic id = 04
May  3 08:43:24 Kyllburg kernel: fault virtual address	= 0x18
May  3 08:43:24 Kyllburg kernel: fault code		= supervisor read data, page not present
May  3 08:43:24 Kyllburg kernel: instruction pointer	= 0x20:0xffffffff8109e252
May  3 08:43:24 Kyllburg kernel: stack pointer	        = 0x28:0xfffffe003fb64780
May  3 08:43:24 Kyllburg kernel: frame pointer	        = 0x28:0xfffffe003fb647c0
May  3 08:43:24 Kyllburg kernel: code segment		= base 0x0, limit 0xfffff, type 0x1b
May  3 08:43:24 Kyllburg kernel: 			= DPL 0, pres 1, long 1, def32 0, gran 1
May  3 08:43:24 Kyllburg kernel: processor eflags	=
May  3 08:43:24 Kyllburg kernel: interrupt enabled, resume, IOPL = 0
May  3 08:43:24 Kyllburg kernel: current process		= 0 (if_io_tqg_0)
May  3 08:45:04 Kyllburg syslogd: kernel boot file is /boot/kernel/kernel
--
May  3 09:54:40 Kyllburg php-fpm[35472]: /rc.newwanip: [siproxd] Package is installed but not started. Not installing firewall rules.
May  3 09:54:40 Kyllburg php-fpm[35472]: /rc.newwanip: [siproxd] Package is installed but not started. Not installing firewall rules.
May  3 09:54:40 Kyllburg kernel: ng6: changing name to 'poes1-6'
May  3 09:54:40 Kyllburg kernel:
May  3 09:54:40 Kyllburg kernel:
May  3 09:54:40 Kyllburg kernel: Fatal trap 12: page fault while in kernel mode
May  3 09:54:40 Kyllburg kernel: cpuid = 0; apic id = 04
May  3 09:54:40 Kyllburg kernel: fault virtual address	= 0x18
May  3 09:54:40 Kyllburg kernel: fault code		= supervisor read data, page not present
May  3 09:54:40 Kyllburg kernel: instruction pointer	= 0x20:0xffffffff8109e252
May  3 09:54:40 Kyllburg kernel: stack pointer	        = 0x28:0xfffffe003fb64780
May  3 09:54:40 Kyllburg kernel: frame pointer	        = 0x28:0xfffffe003fb647c0
May  3 09:54:40 Kyllburg kernel: code segment		= base 0x0, limit 0xfffff, type 0x1b
May  3 09:54:40 Kyllburg kernel: 			= DPL 0, pres 1, long 1, def32 0, gran 1
May  3 09:54:40 Kyllburg kernel: processor eflags	= interrupt enabled, resume,
May  3 09:56:22 Kyllburg syslogd: kernel boot file is /boot/kernel/kernel
May  3 09:56:22 Kyllburg kernel: ---<<BOOT>>---
May  3 09:56:22 Kyllburg kernel: Copyright (c) 1992-2021 The FreeBSD Project.
--
May  3 10:22:42 Kyllburg kernel: ng4: changing name to 'poes1-4'
May  3 10:23:08 Kyllburg radiusd[42472]: (21) Login OK: [xxx/yyy] (from client Self port 5 cli ccce1efed6a4)
May  3 10:23:08 Kyllburg kernel: ng5: changing name to 'poes1-5'
May  3 10:23:08 Kyllburg kernel:
May  3 10:23:08 Kyllburg kernel:
May  3 10:23:08 Kyllburg kernel: Fatal trap 12: page fault while in kernel mode
May  3 10:23:08 Kyllburg kernel: cpuid = 0; apic id = 04
May  3 10:23:08 Kyllburg kernel: fault virtual address	= 0x18
May  3 10:23:08 Kyllburg kernel: fault code		= supervisor read data, page not present
May  3 10:23:08 Kyllburg kernel: instruction pointer	= 0x20:0xffffffff8109e252
May  3 10:23:08 Kyllburg kernel: stack pointer	        = 0x28:0xfffffe003fb64780
May  3 10:23:08 Kyllburg kernel: frame pointer	        = 0x28:0xfffffe003fb647c0
May  3 10:23:08 Kyllburg kernel: code segment		= base 0x0, limit 0xfffff, type 0x1b
May  3 10:23:08 Kyllburg kernel: 			= DPL 0, pres 1, long 1, def32 0, gran 1
May  3 10:23:08 Kyllburg kernel: processor eflags	= interrupt enabled, resume,
May  3 10:24:49 Kyllburg syslogd: kernel boot file is /boot/kernel/kernel
May  3 10:24:49 Kyllburg kernel: ---<<BOOT>>---
May  3 10:24:49 Kyllburg kernel: Copyright (c) 1992-2021 The FreeBSD Project.

So it seems that every time that trap gets triggered it was directly after a change in name of one of the PPPoE service interfaces.

As limiters had a few problems in 22.01 but we also have new(er) drivers, IGC interfaces instead of older NICs (old one were igb AFAIK), and many services with newer versions, we are currently at a loss as to why the boxes are crashing with those kernel traps.

Any insight, help or idea appreciated as the random crashes are many and often and in that state the internet provided isn't very useful right now.
Trying to provide as much intel as I can

Cheers
\jens

stephenw10 · May 3, 2022, 2:51 PM

Is there an actual crash report with a backtrace created?

Mmm, netgraph on igc is an interesting combination. That seems more likely to be an issue that the Limiters. The issues with Limiters we've seen do not crash like that, they just don't pass traffic as expected.
Are they able to test running that config on the ix NICs instead?

Steve

JeGr · May 3, 2022, 2:59 PM

@stephenw10 said in Random crashes "Fatal trap 12: page fault while in kernel mode":

Is there an actual crash report with a backtrace created?

No, unfortunately not. Seems it's immediatly restarted and "killed" after catching that output. No crash dump or anything showing up in filesystem or UI.

@stephenw10 said in Random crashes "Fatal trap 12: page fault while in kernel mode":

Are they able to test running that config on the ix NICs instead?

I could ask but AFAIK the 6100 is only temporary as they suspected the 4100 to be faulty and the WAN1/2 ports are both in use for Uplinks - ix3 is used for pppoe0 and ix2 is a second WAN provider and there's no GBICs or equipment for 10G on site. But I'll ask if they could disable one of the WANs temporarily.

Are you thinking about the the PPPoE Server on igc0 as the culprit? Then I'd have them try ix2 as replacement if possible but I'd have to ask.

stephenw10 · May 3, 2022, 3:44 PM

Are they able to log the console output and trigger the issue? That should get the backtrace unless it's a hard lock.

The issue is probably that PPPoE server uses netgraph and that interfaces with the hardware/drivers at a very low level. If they can easily reproduce the issue then testing it on ix would be a very good test to prove it out.

Steve

JeGr · May 4, 2022, 2:07 PM

@stephenw10

It seems it's not easily triggerable. Box rebooted this morning around 6:30am again with a trap. Log shows the same, trap occurs after a few logins from users to the pppoe server and the server setting/changing the interface names like 'poesX-Y' then directly afterwards threw a hissy fit and rebooted.

There's no core dump nor any trace intel. The system log is all we have. I now have SSH access to the box but can't find any other trace.

Also customer changed the NICs and swapped them around. LAN now is running on ix2 and WAN2 went to igc3 instead. So same thing is happening on the IX style interface with another driver.

...
May 4 06:28:46	kernel		frame pointer = 0x28:0xfffffe003fb647c0
May 4 06:28:46	kernel		stack pointer = 0x28:0xfffffe003fb64780
May 4 06:28:46	kernel		instruction pointer = 0x20:0xffffffff8109e252
May 4 06:28:46	kernel		fault code = supervisor read data, page not present
May 4 06:28:46	kernel		fault virtual address = 0x18
May 4 06:28:46	kernel		cpuid = 0; apic id = 04
May 4 06:28:46	kernel		Fatal trap 12: page fault while in kernel mode
May 4 06:28:46	kernel		ng24: changing name to 'poes1-24'
May 4 06:28:46	radiusd	6763	(43) Login OK: [yyy/xxx] (from client Self port 21 cli 3481c43dad07)
May 4 06:28:44	kernel		ng23: changing name to 'poes1-23'
May 4 06:28:44	radiusd	6763	(42) Login OK: [yyy/xxx] (from client Self port 25 cli e8df70915c6e)
May 4 06:28:44	kernel		ng22: changing name to 'poes1-22'
May 4 06:28:44	radiusd	6763	(41) Login OK: [yyy/xxx] (from client Self port 24 cli 7cff4dc592d7)
May 4 06:28:43	kernel		ng21: changing name to 'poes1-21'
May 4 06:28:43	radiusd	6763	(40) Login OK: [yyy/xxx] (from client Self port 20 cli 2465118bbfb7)
May 4 06:28:43	kernel		ng20: changing name to 'poes1-20'
May 4 06:28:43	radiusd	6763	(39) Login OK: [yyy/xxx] (from client Self port 19 cli e0286db734d7)
May 4 06:28:43	kernel		ng19: changing name to 'poes1-19'
May 4 06:28:43	radiusd	6763	(38) Login OK: [yyy/xxx] (from client Self port 18 cli 989bcb93de7c)
May 4 06:28:43	kernel		ng18: changing name to 'poes1-18'
May 4 06:28:43	radiusd	6763	(37) Login OK: [yyy/xxx] (from client Self port 17 cli 2c91ab0247d8)

Same happened while reconfiguration of the interface directly afterwards as all users came back to login immediatly. Somehow seems too many reconfigurations at one could trigger that? Or maybe some kind of race condition?


May 3 20:19:19	kernel		---<<BOOT>>---
May 3 20:19:19	syslogd		kernel boot file is /boot/kernel/kernel
May 3 20:17:38	kernel		processor eflags = interrupt enabled, resume,
May 3 20:17:38	kernel		= DPL 0, pres 1, long 1, def32 0, gran 1
May 3 20:17:38	kernel		code segment = base 0x0, limit 0xfffff, type 0x1b
May 3 20:17:38	kernel		frame pointer = 0x28:0xfffffe003fb647c0
May 3 20:17:38	kernel		stack pointer = 0x28:0xfffffe003fb64780
May 3 20:17:38	kernel		instruction pointer = 0x20:0xffffffff8109e252
May 3 20:17:38	kernel		fault code = supervisor read data, page not present
May 3 20:17:38	kernel		fault virtual address = 0x18
May 3 20:17:38	kernel		cpuid = 0; apic id = 04
May 3 20:17:38	kernel		Fatal trap 12: page fault while in kernel mode
May 3 20:17:38	kernel		ng20: changing name to 'poes1-20'
May 3 20:17:38	kernel		ng19: changing name to 'poes1-19'
May 3 20:17:37	kernel		ng18: changing name to 'poes1-18'
May 3 20:17:37	kernel		ng17: changing name to 'poes1-17'
May 3 20:17:37	kernel		ng16: changing name to 'poes1-16'
May 3 20:17:37	kernel		ng15: changing name to 'poes1-15'
May 3 20:17:37	kernel		ng14: changing name to 'poes1-14'
May 3 20:17:37	kernel		ng12: changing name to 'poes1-12'
May 3 20:17:37	kernel		ng13: changing name to 'poes1-13'
May 3 20:17:36	kernel		ng11: changing name to 'poes1-11'
May 3 20:17:36	kernel		ng10: changing name to 'poes1-10'
May 3 20:17:36	kernel		ng9: changing name to 'poes1-9'
May 3 20:17:36	kernel		ng8: changing name to 'poes1-8'
May 3 20:17:36	kernel		ng7: changing name to 'poes1-7'
May 3 20:17:36	kernel		ng6: changing name to 'poes1-6'
May 3 20:17:36	kernel		ng5: changing name to 'poes1-5'
May 3 20:17:36	kernel		ng4: changing name to 'poes1-4'
May 3 20:17:36	kernel		ng3: changing name to 'poes1-3'
May 3 20:17:35	kernel		ng2: changing name to 'poes1-2'
May 3 20:17:34	kernel		ng1: changing name to 'poes1-1'
May 3 20:17:30	radiusd	80534	(25) Login OK: [yyy/xxx] (from client Self port 26 cli 3810d5aeced3)
May 3 20:17:30	radiusd	80534	(24) Login OK: [yyy/xxx] (from client Self port 24 cli e8df70915c6e)
May 3 20:17:30	radiusd	80534	(23) Login OK: [yyy/xxx] (from client Self port 21 cli ccce1efed6a4)
May 3 20:17:30	radiusd	80534	(22) Login OK: [yyy/xxx] (from client Self port 22 cli 3810d54da23c)
May 3 20:17:30	radiusd	80534	(21) Login OK: [yyy/xxx] (from client Self port 20 cli 2c91abbba0bf)
May 3 20:17:30	radiusd	80534	(20) Login OK: [yyy/xxx] (from client Self port 19 cli 989bcbae8132)
May 3 20:17:30	radiusd	80534	(19) Login OK: [yyy/xxx] (from client Self port 18 cli 7cff4dc592d7)
May 3 20:17:30	radiusd	80534	(18) Login OK: [yyy/xxx] (from client Self port 16 cli 1ced6f75b2df)
May 3 20:17:29	radiusd	80534	(17) Login OK: [yyy/xxx] (from client Self port 14 cli 989bcb5de4e1)
May 3 20:17:29	radiusd	80534	(16) Login OK: [yyy/xxx] (from client Self port 13 cli dc396f4c37a1)
May 3 20:17:29	radiusd	80534	(15) Login OK: [yyy/xxx] (from client Self port 12 cli 2c91abbfc809)
May 3 20:17:29	radiusd	80534	(14) Login OK: [yyy/xxx] (from client Self port 11 cli 989bcb73d0f7)
May 3 20:17:29	radiusd	80534	(13) Login OK: [yyy/xxx] (from client Self port 10 cli 989bcb93de7c)
May 3 20:17:29	radiusd	80534	(12) Login OK: [yyy/xxx] (from client Self port 9 cli dc15c84b2c26)
May 3 20:17:29	radiusd	80534	(11) Login OK: [yyy/xxx] (from client Self port 8 cli 5c49797ab9e9)
May 3 20:17:28	radiusd	80534	(10) Login OK: [yyy/xxx] (from client Self port 25 cli 5c49797901da)
May 3 20:17:28	radiusd	80534	(9) Login OK: [yyy/xxx] (from client Self port 6 cli 5c497978c2e7)
May 3 20:17:28	radiusd	80534	(8) Login OK: [yyy/xxx] (from client Self port 23 cli dc15c84803c2)
May 3 20:17:28	radiusd	80534	(7) Login OK: [yyy/xxx] (from client Self port 5 cli e0286db734d7)
May 3 20:17:28	radiusd	80534	(6) Login OK: [yyy/xxx] (from client Self port 4 cli 989bcb1a67bb)
May 3 20:17:28	radiusd	80534	(5) Login OK: [yyy/xxx] (from client Self port 2 cli 2c91ab0247d8)
May 3 20:17:28	radiusd	80534	(4) Login OK: [yyy/xxx] (from client Self port 1 cli e0286db33fb5)
May 3 20:17:28	radiusd	80534	(3) Login OK: [yyy/xxx] (from client Self port 17 cli dc396f229ccd)
May 3 20:17:28	radiusd	80534	(2) Login OK: [yyy/xxx] (from client Self port 15 cli ccce1e56e575)
May 3 20:17:27	radiusd	80534	(1) Login OK: [yyy/xxx] (from client Self port 7 cli 3481c43dad07)
May 3 20:17:26	radiusd	80534	(0) Login OK: [yyy/xxx] (from client Self port 3 cli 2c3afddd57fb)

stephenw10 · May 4, 2022, 2:33 PM

Hmm, might need a console log to know more.
Reviewing this I did actually test a PPPoE server on igc when we were testing the 6100 internally. I never saw an issue but I only tested with a few clients so it could well be a condition only created by multiple simultaneous connections.

Steve

JeGr · May 6, 2022, 6:26 AM

@stephenw10 said in Random crashes "Fatal trap 12: page fault while in kernel mode":

Hmm, might need a console log to know more.

Tell me what you need and I'll try to get it :) I'm still on "potential race condition" or bug line but I'll be happy to shed some light on this either way. We checked for a core dump and as I thought - none is written, so that's a no I'm afraid.
Also strikes me odd, that after the mass-reconnection to the new PPPoE Server interface (ix) that immediatly triggered a trap. That's why I was thinking race cond. with multiple reconnects happening at once but that's hard to come by and is annoying the users so nothing one could "easily reproduce" in production. Still wondering why it doesn' happen more frequent but seemingly random.

Cheers
\jens

stephenw10 · May 6, 2022, 12:04 PM

Mmm, if neither of those devices have an SSD then there is probably no SWAP partition configured and that's required for a crash dump.
The panic and backtrace will still be shown on the console though so you can get it bu hooking the console up to something and logging that output until it panics.

I agree it seems likely to be a race condition caused by multiple simultaneous client connections. Unclear what in yet though.
I re-instated my PPPoE server on a 6100 and used a 4100 as the client, both via igc NICs. So far it's been solid but I'm unlikely to be able to trigger it.

Steve

JeGr · May 24, 2022, 8:29 AM

@stephenw10 Coming back to this as the problem persists, the customer had time to do, what TAC support told me.

He re-installed the 4100 that originally should be in place there from scratch. Imported config and switched back from the 6100 to the 4100 on site. Basically the same as above happened again after switching when multiple PPPoE requests came in. Now a few days later and a few reboots, too (sadly it did nothing to fix the problem as assumed), we finally have a coredump to share:

textdump.tar.0
info.0

So at least the reinstall seemed to fix the "no core dump" situation at least

stephenw10 · May 24, 2022, 12:54 PM

Ok, well it's not familiar to me but it's an unusual setup. I'll see if anyone else here has seen it.

stephenw10 · May 24, 2022, 3:03 PM

One of our developers is looking at this. I have opened a bug report for it:
https://redmine.pfsense.org/issues/13210

Steve

JeGr · May 30, 2022, 8:58 AM

@stephenw10 Thanks Steve, will track it there

stephenw10 · Jun 6, 2022, 7:37 PM

This fix is in 2.7 snapshots now, are you able to test that?

Steve

JeGr · Jun 9, 2022, 8:05 AM

@stephenw10 We're already in touch with both, Netgate TAC and the customer to test the fix. We updated the location in question yesterday evening. Besides a problem after the upgrade to 22.05RC of the device not cleanly restarting/booting after upgrade (had that same problem thrice already) after a slight "powerloss-restart" procedure ;) it rebootet and upgraded fine. As far as I can say up intil now we have ~10h of no lockup, freeze or panic.

Not really in the clear yet as we had some big "looking good" windows when testing, too, but it seems to look promising!

knocks on wood

JeGr · Jun 21, 2022, 6:04 AM

To add it here: Customer has updated to a newer RC-snapshot as the earlier got him a few report emails of the box for getting packet loss sometimes (not that often) and he wanted to check if that would be fixed, too.
On the latest RC snapshot thus far no problems to report. No crash dump, no freeze, no panic. Also the packet loss seems gone too :) So happy on both fronts for now - makes me happy to report that.

Great job everyone involved. Shoutout to TAC support for their help and staying on the topic, too!

Cheers
\jens