Frequent Crashing (Page Fault) After Upgrade to 2.8.0 From Latest 2.7

SteveITS

@rfranzke The restore will install packages that were in the backup. There's no need to manually install packages.

It is possible to skip backing up packages. Or restore parts of a config file.

This may help:
https://docs.netgate.com/pfsense/en/latest/backup/restore-during-install.html#restore-using-the-external-configuration-locator-ecl

rfranzke

@SteveITS Thanks for this. I am gonna try this. I found backups labelled with version 23.3 which I think is for 2.7.2.

Is there any value in trying to download the 2.8.0 installer thinking that the upgrade process itself was responsible for the issues I'm having or should I just get back to 2.7.2. My issue here is that if I get this going again on 2.7.2, I'd be too afraid to ever upgrade this setup in the future. I really would like to be able to upgrade this install.

Still really wish I could find some guidance on using this dump file to figure out what's causing this specifically.

netblues

@rfranzke When running in ha, you can upgrade secondary node only and failover to it.
Having a backup of secondary is doable.
If it doesnot crash, you csn upgrade the primary too, or restore.

This is how upgrades are done

As for installer, I doubt a failed package install can crash the whole thing.
This is freebsd, not windows me.

stephenw10

Do you have other crash reports?

That one is not very revealing. However if they are all identical crashes it's probably a software issue.

SteveITS

@rfranzke said in Frequent Crashing (Page Fault) After Upgrade to 2.8.0 From Latest 2.7:

value in trying to download the 2.8.0 installer

There is not a "2.8.0 installer"...there is a 2.7.2 installer, and the new Netgate Installer which lets you choose versions.

@netblues said in Frequent Crashing (Page Fault) After Upgrade to 2.8.0 From Latest 2.7:

When running in ha, you can upgrade secondary node only and failover to it.
Having a backup of secondary is doable.
If it doesnot crash, you csn upgrade the primary too, or restore.

Generally, yes, but per the docs pf may not sync states correctly between FreeBSD versions:
https://docs.netgate.com/pfsense/en/latest/install/upgrade-guide-ha.html#pfsync-considerations
I've never tried to run a different version for long enough for a failover to matter so I can't say offhand if that's actually a normal problem or just a possibility.

rfranzke

@stephenw10 I have one that happened on the secondary FW right after the upgrade to 2.8.1. Same scenario basically. Boot up, runs for about 5-10 minutes and then panics. What is strange is that after this initial panic , it will run for a while solid with no crashes. Seems to be just as it boots up, so not sure if its something starting up that does this. I have FRR running an OSPF process to a lab switch to exchange some internal subnet routes. I had set the process to start when the CARP status is master. That was as somewhat recent config change, but it ran fine with no panics on 2.7. So maybe it trying to figure out if it should start that process or not while watching CARP status between the two machines. This just seems like something the two FWs are trying to work out between each other early in the boot process. Like a late starting daemon or something. I'll try in the morning to start one box up and let it run for a bit before starting the other one and see what we get.

See new dump attached. Thanks for checking here.

textdump.tar (2).0

netblues

@SteveITS said in Frequent Crashing (Page Fault) After Upgrade to 2.8.0 From Latest 2.7:

Generally, yes, but per the docs pf may not sync states correctly between FreeBSD versions:

I agree, however 2.7.2 and 2.8 are on the same freebsd version, so its safe to do so.

And I have tested it recently too with no (obvious) issues

kprovost

@netblues said in Frequent Crashing (Page Fault) After Upgrade to 2.8.0 From Latest 2.7:

@SteveITS said in Frequent Crashing (Page Fault) After Upgrade to 2.8.0 From Latest 2.7:

Generally, yes, but per the docs pf may not sync states correctly between FreeBSD versions:

pfsync tries very hard to be compatible between versions too, but bugs do happen.

I agree, however 2.7.2 and 2.8 are on the same freebsd version, so it's safe to do so.

They are not.
They may both say "15", but they're not the same "15".

netblues

@kprovost Still, minor versions/differences.

No one would do that long term, but considering the situation above, that's the least of problems too.

stephenw10

Hmm, well that's a completely different backtrace.

First Panic:

db:1:pfs> bt
Tracing pid 2 tid 100058 td 0xfffff8006c1a5740
kdb_enter() at kdb_enter+0x33/frame 0xfffffe015852db10
panic() at panic+0x43/frame 0xfffffe015852db70
trap_fatal() at trap_fatal+0x40b/frame 0xfffffe015852dbd0
trap_pfault() at trap_pfault+0x46/frame 0xfffffe015852dc20
calltrap() at calltrap+0x8/frame 0xfffffe015852dc20
--- trap 0xc, rip = 0xffffffff80cf2042, rsp = 0xfffffe015852dcf0, rbp = 0xfffffe015852dd90 ---
__rw_wlock_hard() at __rw_wlock_hard+0x152/frame 0xfffffe015852dd90
arptimer() at arptimer+0x252/frame 0xfffffe015852de10
softclock_call_cc() at softclock_call_cc+0x16d/frame 0xfffffe015852dec0
softclock_thread() at softclock_thread+0xe5/frame 0xfffffe015852def0
fork_exit() at fork_exit+0x7b/frame 0xfffffe015852df30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe015852df30
--- trap 0xafafafaf, rip = 0xafafafafafafafaf, rsp = 0xafafafafafafafaf, rbp = 0xafafafafafafafaf ---

2nd Panic:

db:1:pfs> bt
Tracing pid 11 tid 100003 td 0xfffff8006c179740
kdb_enter() at kdb_enter+0x33/frame 0xfffffe015840b9c0
panic() at panic+0x43/frame 0xfffffe015840ba20
trap_fatal() at trap_fatal+0x40b/frame 0xfffffe015840ba80
trap_pfault() at trap_pfault+0x46/frame 0xfffffe015840bad0
calltrap() at calltrap+0x8/frame 0xfffffe015840bad0
--- trap 0xc, rip = 0xffffffff80d15bdd, rsp = 0xfffffe015840bba0, rbp = 0xfffffe015840bc00 ---
callout_process() at callout_process+0x1ad/frame 0xfffffe015840bc00
handleevents() at handleevents+0x186/frame 0xfffffe015840bc40
timercb() at timercb+0x236/frame 0xfffffe015840bc90
lapic_handle_timer() at lapic_handle_timer+0xab/frame 0xfffffe015840bcb0
Xtimerint() at Xtimerint+0xb1/frame 0xfffffe015840bcb0
--- interrupt, rip = 0xffffffff804eb162, rsp = 0xfffffe015840bd80, rbp = 0xfffffe015840bdb0 ---
acpi_cpu_idle() at acpi_cpu_idle+0x2e2/frame 0xfffffe015840bdb0
cpu_idle_acpi() at cpu_idle_acpi+0x46/frame 0xfffffe015840bdd0
cpu_idle() at cpu_idle+0x9d/frame 0xfffffe015840bdf0
sched_idletd() at sched_idletd+0x546/frame 0xfffffe015840bef0
fork_exit() at fork_exit+0x7b/frame 0xfffffe015840bf30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe015840bf30
--- trap 0xafafafaf, rip = 0xafafafafafafafaf, rsp = 0xafafafafafafafaf, rbp = 0xafafafafafafafaf ---

I would try to compare more crashes if you can. Different and random backtraces like that usually points to a hardware issue. But that's from different version so it may simply be different there.

rfranzke

OK quick update here. I ran through the reinstall process on the backup firewall using the USB stick for install as well as using the USB config import on first boot. Seems to have worked mostly. None of the packages got re-installed however using this method so I re-installed them manually. Not sure why that did not work or happen.....maybe it just doesn't do that using this method. No matter I know the process now. Very cool it works this way and good to know so thanks for that heads up here. For those of you keeping track at home, I have re-installed the backup FW with a fresh USB stick using the 2.8.0 option. So I at the moment have a 2.8.1 install on the primary firewall and 2.8.0 running on the backup firewall. I'll see how this goes in terms of crash dumps for a bit and then upgrade the primary if all goes well. I wanted to do this to see if I can get on a proper latest stable build of CE. If the crashes continue I will have new dumps to post of the same OS. 2.8.0 is where I think I want to be as the changes in the new .1 beta did not help me here. But if nothing else its good for this N00B to run through the backup/restore process. Will seehowit goes. Thanks for the help all.

stephenw10

Hmm, the package should get reinstalled when importing a config. The only reason it wouldn't is if there was no access to the repo after the boot when the install is triggered. But you should see something logged if that happens.
I sometimes see that if the interface config is very different and that's still in progress. Restoring the config again after boot will correctly reinstall packages if that's the case.

rfranzke

@stephenw10 Yes, I think this is exactly what happened here. After my last post I realized that the gateway I currently use to get to the Internet was not configured. I have a third link I use to get off net in my test environment. This config is for our data center environment and has IP address that do not exist here. So, I created a third DHCP interface to tie this into the actual LAN the boxes are currently on. I switch to using this interface as the gateway when I need them to be able to download PFBlockerNG updates, access Netgate Servers, etc. For some reason in my config I imported the GW was set to the normal GW which will work in my DC setup. Just doesn't work here. So, I had to manually switch to using the secondary gateway to download and install the missing packages.

I got impatient and went ahead and reinstalled the primary with 2.8.0 and restored the config there. This time I saw a message on first login that said it was re-installing the packages in the background. I switched to the opt1 interface gateway and all the packages were installed perfectly. Not sure why I had to much trouble with previous backup/restores but this works slick today.

So now I am running the HA pair both on fresh installs of 2.8.0 (not upgraded from 2.7.0). Will let this bake for today and see what we get. Will post any additional dumps I get here.

Thanks all for the help.

rfranzke

Well almost as soon as I hit 'Submit' on my last post, the backup firewall panicked again. ITs been running solid for several hours since the re-install. As soon as I get the primary up and running, something happened and all of a sudden crashed. Was just posting here so doing nothing on the FW. Just crash. Not sure I made this clear here but these are in my test environment. No traffic is going through them currently. Likely known but wanted to make sure folks know. They are sitting next to me on a test rack I have and can hear when the server restarts due to the fans spinning up.

Anyway here is the latest dump on 2.8.0:

textdump.tar (4).0

rfranzke

Any idea if FreeBSD/PFSense 15 has issues with nvme drives. Thats what I am running on these systems. Drives I am running in these boxes:

Crucial P3 2TB PCIe Gen3 3D NAND NVMe M.2 SSD CT2000P3SSD8

Memory:

Samsung 64GB DDR4 PC4-21300 2666MHz LRDIMM Quad Ranked Registered ECC Memory (M386A8K40BM2-CTD)

stephenw10

Hmm, OK that's identical to the first crash. Which was also in 2.8.0. Was that actually the same device?

No there's no known issue with NVMe drives. We use them in our hardware.

rfranzke

@stephenw10 said in Frequent Crashing (Page Fault) After Upgrade to 2.8.0 From Latest 2.7:

Hmm, OK that's identical to the first crash. Which was also in 2.8.0. Was that actually the same device?

No there's no known issue with NVMe drives. We use them in our hardware.

Yes, same code, same device as the first dump post. It's the backup device in the CARP HA pair if that makes any difference. Both devices have the same exact hardware (CPU, MB, Disk, Mem, etc.).

Does the dump tell you anything specific as to what is happening, or just that it is the same as before. Should we be able to glean anything from this type of dump as to a specific cause, or do they more tell you that a crash occurred. Like a marker that something happened in case you are not around to witness it firsthand. These are sitting right next to me, so I am lucky enough to know when panics happen before even seeing the dump file. Are these dumps not capturing enough data to tell WHY the crash happened? Something I can tweak in the config to get additional info as to the cause?

Incidentally I got the one panic this morning and tried to get another today but this thing never panicked again. Go figure.

stephenw10

It doesn't mean anything to me. I can see it doesn't have much that's non-generic but I'll run it past some devs tomorrow.

Next step is either to enable a full core dump to analyse or try running the debug kernel.
https://docs.netgate.com/pfsense/en/latest/troubleshooting/debug-kernel.html

Do you have SWAP enabled on those? How big is it?
To get a full core dump usually requires SWAP at least as large as the RAM to dump it to.

rfranzke

@stephenw10 said in Frequent Crashing (Page Fault) After Upgrade to 2.8.0 From Latest 2.7:

It doesn't mean anything to me. I can see it doesn't have much that's non-generic but I'll run it past some devs tomorrow.

Next step is either to enable a full core dump to analyse or try running the debug kernel.
https://docs.netgate.com/pfsense/en/latest/troubleshooting/debug-kernel.html

Do you have SWAP enabled on those? How big is it?
To get a full core dump usually requires SWAP at least as large as the RAM to dump it to.

I'm not sure how one would go about "enabling swap". I didn't do anything to specifically enable it anywhere that I am aware of. Installed using the installer, imported my config from before. If it's not enabled by default, then I likely don't have it enabled. Both boxes have 64GB of RAM installed in them and 2 TB Nvme drives.

Thanks for the link. I'll see about loading up the debug kernel to see it reveals anything useful.
Thanks for checking with the devs here. Again really appreciate the help on this.

stephenw10

It would be enabled by default but probably not at >64GB so dumping a full core to it may or may not be possible depending on how much RAM is actually in use. But first check how much SWAP there is. It's shown on the dashboard.