Frequent Crashing (Page Fault) After Upgrade to 2.8.0 From Latest 2.7
-
Hmm, the package should get reinstalled when importing a config. The only reason it wouldn't is if there was no access to the repo after the boot when the install is triggered. But you should see something logged if that happens.
I sometimes see that if the interface config is very different and that's still in progress. Restoring the config again after boot will correctly reinstall packages if that's the case. -
@stephenw10 Yes, I think this is exactly what happened here. After my last post I realized that the gateway I currently use to get to the Internet was not configured. I have a third link I use to get off net in my test environment. This config is for our data center environment and has IP address that do not exist here. So, I created a third DHCP interface to tie this into the actual LAN the boxes are currently on. I switch to using this interface as the gateway when I need them to be able to download PFBlockerNG updates, access Netgate Servers, etc. For some reason in my config I imported the GW was set to the normal GW which will work in my DC setup. Just doesn't work here. So, I had to manually switch to using the secondary gateway to download and install the missing packages.
I got impatient and went ahead and reinstalled the primary with 2.8.0 and restored the config there. This time I saw a message on first login that said it was re-installing the packages in the background. I switched to the opt1 interface gateway and all the packages were installed perfectly. Not sure why I had to much trouble with previous backup/restores but this works slick today.
So now I am running the HA pair both on fresh installs of 2.8.0 (not upgraded from 2.7.0). Will let this bake for today and see what we get. Will post any additional dumps I get here.
Thanks all for the help.
-
Well almost as soon as I hit 'Submit' on my last post, the backup firewall panicked again. ITs been running solid for several hours since the re-install. As soon as I get the primary up and running, something happened and all of a sudden crashed. Was just posting here so doing nothing on the FW. Just crash. Not sure I made this clear here but these are in my test environment. No traffic is going through them currently. Likely known but wanted to make sure folks know. They are sitting next to me on a test rack I have and can hear when the server restarts due to the fans spinning up.
Anyway here is the latest dump on 2.8.0:
-
Any idea if FreeBSD/PFSense 15 has issues with nvme drives. Thats what I am running on these systems. Drives I am running in these boxes:
Crucial P3 2TB PCIe Gen3 3D NAND NVMe M.2 SSD CT2000P3SSD8
Memory:
Samsung 64GB DDR4 PC4-21300 2666MHz LRDIMM Quad Ranked Registered ECC Memory (M386A8K40BM2-CTD)
-
Hmm, OK that's identical to the first crash. Which was also in 2.8.0. Was that actually the same device?
No there's no known issue with NVMe drives. We use them in our hardware.
-
@stephenw10 said in Frequent Crashing (Page Fault) After Upgrade to 2.8.0 From Latest 2.7:
Hmm, OK that's identical to the first crash. Which was also in 2.8.0. Was that actually the same device?
No there's no known issue with NVMe drives. We use them in our hardware.
Yes, same code, same device as the first dump post. It's the backup device in the CARP HA pair if that makes any difference. Both devices have the same exact hardware (CPU, MB, Disk, Mem, etc.).
Does the dump tell you anything specific as to what is happening, or just that it is the same as before. Should we be able to glean anything from this type of dump as to a specific cause, or do they more tell you that a crash occurred. Like a marker that something happened in case you are not around to witness it firsthand. These are sitting right next to me, so I am lucky enough to know when panics happen before even seeing the dump file. Are these dumps not capturing enough data to tell WHY the crash happened? Something I can tweak in the config to get additional info as to the cause?
Incidentally I got the one panic this morning and tried to get another today but this thing never panicked again. Go figure.
-
It doesn't mean anything to me. I can see it doesn't have much that's non-generic but I'll run it past some devs tomorrow.
Next step is either to enable a full core dump to analyse or try running the debug kernel.
https://docs.netgate.com/pfsense/en/latest/troubleshooting/debug-kernel.htmlDo you have SWAP enabled on those? How big is it?
To get a full core dump usually requires SWAP at least as large as the RAM to dump it to. -
@stephenw10 said in Frequent Crashing (Page Fault) After Upgrade to 2.8.0 From Latest 2.7:
It doesn't mean anything to me. I can see it doesn't have much that's non-generic but I'll run it past some devs tomorrow.
Next step is either to enable a full core dump to analyse or try running the debug kernel.
https://docs.netgate.com/pfsense/en/latest/troubleshooting/debug-kernel.htmlDo you have SWAP enabled on those? How big is it?
To get a full core dump usually requires SWAP at least as large as the RAM to dump it to.I'm not sure how one would go about "enabling swap". I didn't do anything to specifically enable it anywhere that I am aware of. Installed using the installer, imported my config from before. If it's not enabled by default, then I likely don't have it enabled. Both boxes have 64GB of RAM installed in them and 2 TB Nvme drives.
Thanks for the link. I'll see about loading up the debug kernel to see it reveals anything useful.
Thanks for checking with the devs here. Again really appreciate the help on this. -
It would be enabled by default but probably not at >64GB so dumping a full core to it may or may not be possible depending on how much RAM is actually in use. But first check how much SWAP there is. It's shown on the dashboard.
-
Looks like maybe SWAP is enabled but only to 1024MB.
Seem correct? Dark is primary, light is secondary and the one that's crashed the most.
-
Hmm. Unfortunately I think you'd almost certainly need to reinstall that with a more SWAP to be able to dump the full core.
-
@stephenw10 said in Frequent Crashing (Page Fault) After Upgrade to 2.8.0 From Latest 2.7:
Hmm. Unfortunately I think you'd almost certainly need to reinstall that with a more SWAP to be able to dump the full core.
Would the running debug kernel you propose avoid the need to reinstall to change swap to get what we would need, or would I need to change swap to get the dumps even with debug kernel in place? There is no way to change the amount of swap without reinstalling?
-
If you have any additional backtraces to compare that would be useful. Particularly from 2.8.1 to confirm you get the same thing there repeatedly.
-
The debug kernel doesn't require SWAP, so no reinstall, but it may not tell us more. It's worth trying though.
-
Mmm, looks like that first crash could be this: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=285813
-
@stephenw10 said in Frequent Crashing (Page Fault) After Upgrade to 2.8.0 From Latest 2.7:
The debug kernel doesn't require SWAP, so no reinstall, but it may not tell us more. It's worth trying though.
If having proper swap in place to catch this is the sure-fire way to capture the relevant information needed to determine what this is, I'll work on that. I have the process of reinstalling down pretty good now.....not sure I know how to properly make swap adjustments needed to get this right but am willing to give it a go if it reveals something useful here.
Any guidance on what the swap size should be here? It doesn't look like I'm using a ton of memory and I think general FreeBSD guidance is twice the amount of memory in the system. Does that sound like a reasonable set up for this test?
-
@stephenw10 said in Frequent Crashing (Page Fault) After Upgrade to 2.8.0 From Latest 2.7:
Mmm, looks like that first crash could be this: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=285813
Hrmm......would that sort of issue be seen with CARP enabled? And seems that's still an open bug if true. Wonder how soon something like that gets fixed in BSD.
Strangely, I cannot get this to crash now. Sine the reinstall and the one panic I posted, these have been rock-solid with no panics. Good and bad.
-
OK I think I have the swap configured now:
Anything else required to get this to get the info we need?
-
Ok cool. So to enable full core dumps you need to edit the file
/etc/pfSense-ddb.conf
.Change the script kdb.enter.default line to:
script kdb.enter.default=bt ; show registers ; dump ; reset
Reboot then check the output of:
sysctl debug.ddb.scripting.scripts
Make sure it shows the changed line.
Then you can test it by manually triggering a panic by running:
sysctl sysctl debug.kdb.panic=1
You should see the core file after it reboots.After that just wait for the next crash or somehow trigger it if you can.
-
@stephenw10 said in Frequent Crashing (Page Fault) After Upgrade to 2.8.0 From Latest 2.7:
Ok cool. So to enable full core dumps you need to edit the file
/etc/pfSense-ddb.conf
.Change the script kdb.enter.default line to:
script kdb.enter.default=bt ; show registers ; dump ; reset
Reboot then check the output of:
sysctl debug.ddb.scripting.scripts
Make sure it shows the changed line.
Then you can test it by manually triggering a panic by running:
sysctl sysctl debug.kdb.panic=1
You should see the core file after it reboots.After that just wait for the next crash or somehow trigger it if you can.
OK I think I have this done:
# $FreeBSD$
#
# This file is read when going to multi-user and its contents piped thru
#ddb'' to define debugging scripts. \# \# see
man 4 ddb'' and ``man 8 ddb'' for details.
#script lockinfo=show locks; show alllocks; show lockedvnods
script pfs=bt ; show registers ; show pcpu ; run lockinfo ; acttrace ; ps ; alltrace# kdb.enter.panic panic(9) was called.
# script kdb.enter.default=textdump set; capture on; run pfs ; capture off; textdump dump; reset
script kdb.enter.default=bt ; show registers ; dump ; reset# kdb.enter.witness witness(4) detected a locking error.
script kdb.enter.witness=run lockinfosysctl debug.ddb.scripting.scripts
debug.ddb.scripting.scripts: lockinfo=show locks; show alllocks; show lockedvnods
pfs=bt ; show registers ; show pcpu ; run lockinfo ; acttrace ; ps ; alltrace
kdb.enter.default=bt ; show registers ; dump ; reset
kdb.enter.witness=run lockinfoI cannot seem to have this thing crash anymore. I'll see if I can mess with it to get it to panic again. Let me know if this setting looks right. Thanks again here for all the help. Really appreciate the time.