25.07 upgrade on Netgate 4100 gets rolled back

NOCling

I run into it with my 2100 on a normal reboot with 24.11.
With pfBlockerNG and RAM Disk, the default Time for Boot verification is too short.

I increased it massively, 1800 for my 6100 and 3000 for my 2100.

stephenw10

You can mount the BE by simply running: bectl mount <be_name>. It will show you a mount point in /tmp.

Just be sure to bectl unmount it.

JeGr

@stephenw10 After having a look at the device from remote, it isn't obvious what is happening. The logs show rebooting from 24.11, then ~13min nothing in the logs, then the boot back to 24.11 again. So it seems 25.07 doesn't get to the stage to actually write some logs. But whatever takes the 10-12min after the install, there seems no trace of it.

We've arranged to have someone hands-on on site ASAP that can offer us a serial console via LTE or another uplink so we can look what happens after the first reboot. So far no real idea what is happening. Also strange as this is one 4100, other 4100 have rebooted and installed fine. So no real indicator right now.

stephenw10

Hmm, strange indeed.

JeGr

@stephenw10 said in 25.07 upgrade on Netgate 4100 gets rolled back:

Hmm, strange indeed.

To follow up on that as we finally got around to have some hands-on on site in the US locations that happened it boiled down to two points:

One box had the above problem because they had 2 large old snapshots and were old 4100 boxes with the very little eMMC storage. So after cleaning up those snaps the new update was going fine. Funny though that they didn't hang while installing but somehow when booting the new snapshot but OK.
The second box took more tries but the problem was ...
drumroll
pfBlockerNG!
The misbehavior mentioned multiple times e.g. in this post that pfBNG creates useless audit snapshots of empty config.xml diffs and the audit bug, that somehow triggered more then the configured amount of configs to be stored was the root cause of the problem.
The box in question had 121,387 config-<timestamp>.xml files in /cf/conf/backup directory that accounted to around 1.5G in files. But it wasn't the disk space that were the problem but somehow the snapshot booted and wouldn't be able to access /cf/conf or cf/conf/backup because the process that tried to do something didn't succeed as the directory in question had too many files that broke some shell script magic.

After seeing the bootup breaking at that point, we booted back to the old snapshot, deleted the failed update snap and also deleted all old backups in /cf/conf/backup so only the configured 50 last backup steps were still available. Then we re-did the update that then went through without a problem.

So not directly an update problem but a bug in pfBNG + config history audit management that resulted in thousands of backup files created (and not cleaned up) that made the /cf/conf subset unavailable while upgrade/booting into it.

Hope that helps!

Cheers!

stephenw10

Urgh, painful. Thanks for following up.

Yup that backup config bug is resolved in current versions but doesn't help at upgrades.

SteveITS

It'd be helpful/preventative, I think, if the upgrade would do a quick check "are there more than ___ backup config files in the directory?" before upgrading. Not sure if that should be "more than the configured number, or more than 500, or what, but a few thousand is enough to cause a long (10m?) page load and eventual timeout loading the config history page as it tries to delete them. Perhaps the warning could link to a troubleshooting document page.

JeGr

@SteveITS Indeed. Also when running a full update like firmware update -> 25.07 that could perhaps be an additional sanity check to perform as would be a check for old snapshots or disk space < xyGB free. Both things (too many snapshots, too much disk space in use) as well as the file overflow thing were stuff, that we stumbled upon on multiple customers that were running into problems when upgrading their boxes. After the first ones, it was easy to spot on subsequent customers. Even my own homebrew box had the file overflow without me noticing and I just thought it strange that it used 3.4G disk space when a normal installation would be around ~2G without snaps. Only then I remembered - oh snap, I'm running pfB, too and haven't added the hotfix for the file overflow that we were testing...

So perhaps those 3 cases would make for a few additional easy pre-flight checks for future updates :)

Cheers

SteveITS

@JeGr I think (?) it tries to check space but it's not uncommon to see posts about failed upgrades for space reasons. Maybe it needs a larger free space check.

We had one client with an old 2440 I recently upgraded through several versions successfully but it's at 94% full because of all the old files and I don't think I want to try 25.11, remotely. :-/

stephenw10

Hmm, I agree. Let me see what we can do here.

vronp

@stephenw10 What would help at upgrades? :-)

I have a 4200 and am having the same problem, presumably. I have pfblockerng installed.

I'm also seeing:

ld-elf.so.1: Shared object "libmd.so.7" not found, required by "pfSense-repoc"

I have a ticket open at Netgate and they want me to do a USB upgrade. That didn't feel right to me so I started searching and found this thread.

vronp

@stephenw10

Any ideas on this. BTW, my memory: 30% of 3890 MiB on a 4200

SteveITS

@vronp said in 25.07 upgrade on Netgate 4100 gets rolled back:

ld-elf.so.1: Shared object "libmd.so.7" not found, required by "pfSense-repoc"

That's different, see
https://forum.netgate.com/topic/198754/ld-elf.so.1-shared-object-libmd.so.7-not-found-required-by-pfsense-repoc

But too many old config files can be a problem also, sure. How much free disk space do you have?

stephenw10

@vronp said in 25.07 upgrade on Netgate 4100 gets rolled back:

I'm also seeing:

ld-elf.so.1: Shared object "libmd.so.7" not found, required by "pfSense-repoc"

That's just an ugly error it should not prevent upgrading. If you run at the CLI: pfSense-repoc-static -N it should succeed as expected and that's what the upgrade uses.

vronp

@stephenw10

Thanks. I also discovered 15,000 files in /cf/conf/backup

It seems I need to clean that up. Is there a limit setting for backups there or is this the pfblockerng bug that was mentioned?

SteveITS

@vronp The default is 30 I believe. fixed in 25.07:
https://docs.netgate.com/pfsense/en/latest/releases/25-07.html#configuration-backend

pfB just makes it worse by generating one per cron job (default per hour).

Diagnostics > Configuration History will time out while it tries to delete them all, just refresh every time it does. Or delete manually.

vronp

@SteveITS
28% of 3890 MiB
28% of 4.6G (zfs)

I also just found 15,000 files in /cf/conf/backup

stephenw10

This is a bug. It should be limited to 30 backups there. The bug was that it was only pruning the backups when the user visited the Diag > Backup&Restore page. If you visit that page it will try to prune them. It might take a while if you have 15K files! It;s fixed in 25.07.

vronp

@SteveITS Thank you. I'm going to try to run an upgrade again as I'm hoping that the problem described above (copied below) is the cause of my problem even though I only have 15,000 files in that directory.

"The box in question had 121,387 config-<timestamp>.xml files in /cf/conf/backup directory that accounted to around 1.5G in files. But it wasn't the disk space that were the problem but somehow the snapshot booted and wouldn't be able to access /cf/conf or cf/conf/backup because the process that tried to do something didn't succeed as the directory in question had too many files that broke some shell script magic."

stephenw10

Yeah it tries to parse all the files in that folder when it runs the config upgrade and has a really bad time! It should be fine after pruning them.