Frequent Crashing (Page Fault) After Upgrade to 2.8.0 From Latest 2.7
-
Looks like maybe SWAP is enabled but only to 1024MB.
Seem correct? Dark is primary, light is secondary and the one that's crashed the most.
-
Hmm. Unfortunately I think you'd almost certainly need to reinstall that with a more SWAP to be able to dump the full core.
-
@stephenw10 said in Frequent Crashing (Page Fault) After Upgrade to 2.8.0 From Latest 2.7:
Hmm. Unfortunately I think you'd almost certainly need to reinstall that with a more SWAP to be able to dump the full core.
Would the running debug kernel you propose avoid the need to reinstall to change swap to get what we would need, or would I need to change swap to get the dumps even with debug kernel in place? There is no way to change the amount of swap without reinstalling?
-
If you have any additional backtraces to compare that would be useful. Particularly from 2.8.1 to confirm you get the same thing there repeatedly.
-
The debug kernel doesn't require SWAP, so no reinstall, but it may not tell us more. It's worth trying though.
-
Mmm, looks like that first crash could be this: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=285813
-
@stephenw10 said in Frequent Crashing (Page Fault) After Upgrade to 2.8.0 From Latest 2.7:
The debug kernel doesn't require SWAP, so no reinstall, but it may not tell us more. It's worth trying though.
If having proper swap in place to catch this is the sure-fire way to capture the relevant information needed to determine what this is, I'll work on that. I have the process of reinstalling down pretty good now.....not sure I know how to properly make swap adjustments needed to get this right but am willing to give it a go if it reveals something useful here.
Any guidance on what the swap size should be here? It doesn't look like I'm using a ton of memory and I think general FreeBSD guidance is twice the amount of memory in the system. Does that sound like a reasonable set up for this test?
-
@stephenw10 said in Frequent Crashing (Page Fault) After Upgrade to 2.8.0 From Latest 2.7:
Mmm, looks like that first crash could be this: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=285813
Hrmm......would that sort of issue be seen with CARP enabled? And seems that's still an open bug if true. Wonder how soon something like that gets fixed in BSD.
Strangely, I cannot get this to crash now. Sine the reinstall and the one panic I posted, these have been rock-solid with no panics. Good and bad.
-
OK I think I have the swap configured now:
Anything else required to get this to get the info we need?
-
Ok cool. So to enable full core dumps you need to edit the file
/etc/pfSense-ddb.conf
.Change the script kdb.enter.default line to:
script kdb.enter.default=bt ; show registers ; dump ; reset
Reboot then check the output of:
sysctl debug.ddb.scripting.scripts
Make sure it shows the changed line.
Then you can test it by manually triggering a panic by running:
sysctl sysctl debug.kdb.panic=1
You should see the core file after it reboots.After that just wait for the next crash or somehow trigger it if you can.
-
@stephenw10 said in Frequent Crashing (Page Fault) After Upgrade to 2.8.0 From Latest 2.7:
Ok cool. So to enable full core dumps you need to edit the file
/etc/pfSense-ddb.conf
.Change the script kdb.enter.default line to:
script kdb.enter.default=bt ; show registers ; dump ; reset
Reboot then check the output of:
sysctl debug.ddb.scripting.scripts
Make sure it shows the changed line.
Then you can test it by manually triggering a panic by running:
sysctl sysctl debug.kdb.panic=1
You should see the core file after it reboots.After that just wait for the next crash or somehow trigger it if you can.
OK I think I have this done:
# $FreeBSD$
#
# This file is read when going to multi-user and its contents piped thru
#ddb'' to define debugging scripts. \# \# see
man 4 ddb'' and ``man 8 ddb'' for details.
#script lockinfo=show locks; show alllocks; show lockedvnods
script pfs=bt ; show registers ; show pcpu ; run lockinfo ; acttrace ; ps ; alltrace# kdb.enter.panic panic(9) was called.
# script kdb.enter.default=textdump set; capture on; run pfs ; capture off; textdump dump; reset
script kdb.enter.default=bt ; show registers ; dump ; reset# kdb.enter.witness witness(4) detected a locking error.
script kdb.enter.witness=run lockinfosysctl debug.ddb.scripting.scripts
debug.ddb.scripting.scripts: lockinfo=show locks; show alllocks; show lockedvnods
pfs=bt ; show registers ; show pcpu ; run lockinfo ; acttrace ; ps ; alltrace
kdb.enter.default=bt ; show registers ; dump ; reset
kdb.enter.witness=run lockinfoI cannot seem to have this thing crash anymore. I'll see if I can mess with it to get it to panic again. Let me know if this setting looks right. Thanks again here for all the help. Really appreciate the time.
-
Yup that looks good. You can try the forced manual panic just to make sure it create the core file but I'm pretty confident it will.
Otherwise just wait for the next crash.
-
@stephenw10 said in Frequent Crashing (Page Fault) After Upgrade to 2.8.0 From Latest 2.7:
You can try the forced manual panic just to make sure it create the core file but I'm pretty confident it will.
Yeah, I forgot to do that. Did it just now and it did restart. Created a file called 'VMCore.0' thats like 2.5GB in size. That sound about right?
-
I don't know if it helps anyone but I was having a kernel panic issue on the first boot after trying to install 2.8 and in my case it was:
iwm7265Dfw: could not load firmware image, error 6
I was able to fix it by dropping into the shell of the installer after the installation process and before the final reboot, and adding this line:
hint.iwm.0.disabled="1"
to the end of
/mnt/boot/loader.conf
-
No that's an unrelated bug. This one looks more difficult to fix unfortunately!
-
So unfortunately, I have been monkeying with this all day and have not been able to get this to panic. I'm not sure what's changed other than the panic dump config changes and re-installing the software via the NetGate installer. These things must know we are on to them.
I've tried ever combination of restarting, shutting off switches, unplugging ports, restart one, keep one running, blah, blah. They won't panic now.
I did notice some of my FRR OSPF configuration did not come over in the re-install process, namely the interface authentication config. It's quite possible I had removed it at some point in my testing, but I don't think so. No issues anyone is aware of with FRR configs not importing correctly on 2.8? I would doubt it, and its not important to this issue, but thought I'd ask while we wait for these things to panic again.
-
Well at least you're setup to catch it now if/when it does.