Random crashes "Fatal trap 12: page fault while in kernel mode"

stephenw10

Hmm, might need a console log to know more.
Reviewing this I did actually test a PPPoE server on igc when we were testing the 6100 internally. I never saw an issue but I only tested with a few clients so it could well be a condition only created by multiple simultaneous connections.

Steve

JeGr

@stephenw10 said in Random crashes "Fatal trap 12: page fault while in kernel mode":

Hmm, might need a console log to know more.

Tell me what you need and I'll try to get it :) I'm still on "potential race condition" or bug line but I'll be happy to shed some light on this either way. We checked for a core dump and as I thought - none is written, so that's a no I'm afraid.
Also strikes me odd, that after the mass-reconnection to the new PPPoE Server interface (ix) that immediatly triggered a trap. That's why I was thinking race cond. with multiple reconnects happening at once but that's hard to come by and is annoying the users so nothing one could "easily reproduce" in production. Still wondering why it doesn' happen more frequent but seemingly random.

Cheers
\jens

stephenw10

Mmm, if neither of those devices have an SSD then there is probably no SWAP partition configured and that's required for a crash dump.
The panic and backtrace will still be shown on the console though so you can get it bu hooking the console up to something and logging that output until it panics.

I agree it seems likely to be a race condition caused by multiple simultaneous client connections. Unclear what in yet though.
I re-instated my PPPoE server on a 6100 and used a 4100 as the client, both via igc NICs. So far it's been solid but I'm unlikely to be able to trigger it.

Steve

JeGr

@stephenw10 Coming back to this as the problem persists, the customer had time to do, what TAC support told me.

He re-installed the 4100 that originally should be in place there from scratch. Imported config and switched back from the 6100 to the 4100 on site. Basically the same as above happened again after switching when multiple PPPoE requests came in. Now a few days later and a few reboots, too (sadly it did nothing to fix the problem as assumed), we finally have a coredump to share:

textdump.tar.0
info.0

So at least the reinstall seemed to fix the "no core dump" situation at least

stephenw10

Ok, well it's not familiar to me but it's an unusual setup. I'll see if anyone else here has seen it.

stephenw10

One of our developers is looking at this. I have opened a bug report for it:
https://redmine.pfsense.org/issues/13210

Steve

JeGr

@stephenw10 Thanks Steve, will track it there

stephenw10

This fix is in 2.7 snapshots now, are you able to test that?

Steve

JeGr

@stephenw10 We're already in touch with both, Netgate TAC and the customer to test the fix. We updated the location in question yesterday evening. Besides a problem after the upgrade to 22.05RC of the device not cleanly restarting/booting after upgrade (had that same problem thrice already) after a slight "powerloss-restart" procedure ;) it rebootet and upgraded fine. As far as I can say up intil now we have ~10h of no lockup, freeze or panic.

Not really in the clear yet as we had some big "looking good" windows when testing, too, but it seems to look promising!

knocks on wood

JeGr

To add it here: Customer has updated to a newer RC-snapshot as the earlier got him a few report emails of the box for getting packet loss sometimes (not that often) and he wanted to check if that would be fixed, too.
On the latest RC snapshot thus far no problems to report. No crash dump, no freeze, no panic. Also the packet loss seems gone too :) So happy on both fronts for now - makes me happy to report that.

Great job everyone involved. Shoutout to TAC support for their help and staying on the topic, too!

Cheers
\jens