Upgrade to 25.07.1 experience on two SG1100s...
-
I have two shelf-spare 1100s. Usually, I try upgrades first with them before upgrading my production 4200.
Both 1100s had the same configuration but otherwise had not been used or plugged into power for some months.
Configurations are pretty typical. Packages are:
apcupsd
Cron
mailreport
pfBlockerNG
Status_Traffic_Totals
System_PatchesThe first upgrade went pretty well. I had to get past the obnoxious warning that another update was currently running. I did that by unplugging the WAN cable briefly. I had to update the certificates after the update but that went fine as well.
I did have an error that the GUI was out of memory. I adjusted the parameter in System>Advanced and that fixed that. Why the default memory was enough in 24.11 but not 25.07.1 is a mystery to me, but obviously the GUI got bigger.
Netgate should test resource availability during the update to determine if the configuration is sufficient and adjust, rather than just leaving these things to blow up after the upgrade.
The second upgrade was a mess. I repeatedly got that 'other update running error' in the GUI and unplugging the WAN cable did not resolve this. I finally ran the upgrade from the console with option 13 and it worked.
But then, every time the update finished, the WAN would be offline when it was online on 24.11 just prior to and during the update. No changes besides the update.
I reverted to the 24.11 baseline boot environment and ran the update again with the same offline WAN problem again appearing after the upgrade.
I tried to bring it online with some usual hacks like saving the gateway or interface and restarting, or rebooting. Nothing. Something about the upgrade process itself took the 25.07.1 WAN offline and left it that way. But only on this 1100, not the first one. Weird. Logs weren't helpful.
So for this second update I dusted off and nuked the site from orbit by doing a new from-scratch 25.07.1 installation from the installation USB media and it worked fine. After restoring the config backup, the WAN was online and all was well. Of course, I did have to update the GUI certificates and increase the GUI memory as I did on the first one.
So I can hypothesize from this experience with updating two same-config 1100s is that:
-
The upgrade process to 25.07.1 is not a predictable, repeatable experience, even with Netgate hardware in near-identical configurations. One 1100 configuration was a restore of the other 1100's configuration with one change in the system name to distinguish them. Why does this inconsistent upgrade happen?
-
With two upgrades done, I have no clarity that my upgrade of the 4200 in production will go smoothly or turn into a mess like my second 1100 experience. I guess I'll have to have one of the backup 1100s ready to go in case the 4200 upgrade goes sideways and takes a long time to fix.
I hope Netgate will put in the effort to examine the code and process of upgrades and see why the experience might be so inconsistent on Netgate hardware nearly-identically configured.
What assumptions is the upgrade process making about the initial state of the target system that it should not be making? Is there technical debt building pfSense on a, albeit maintained and expanded, 31 year old OS that is based on a 54 year old OS?
I like this system and it works well once these snags are worked around. I would love the system to deliver a more consistent and predictable upgrade experience.
I'll report back when I upgrade the 4200. I'm hoping it goes more smoothly maybe because it's running all the time instead of sitting on a shelf. Maybe that's a factor.
-
-
The 1100 hit's most of those issues because of memory limitations. Especially with pfBlocker active if you have large lists. I wouldn't expect a 4200 to have a problem with that.
When you say the WAN is offline do you mean the WAN gateway showed as offline or the actual WAN interface/port was down?
-
@stephenw10 thank you for your response.
I don't think pfBlocker was a factor because, on both 1100s, I removed all the packages, including pfBlocker, prior to starting the upgrade process and I had not re-installed it yet after the upgrade reboot when the problems manifested.
I know the 1100 is getting memory constrained. But why the different experience with two identically configured 1100s upgraded in short succession?
Netgate should consider incorporating memory and storage checks during the upgrade process so it's less likely to get into a mess that gets dumped into the lap of the end user.
On my second 1100, upon reboot into 25.07.1 following the upgrade, it showed the wan gateway as down, it wouldn't ping and I could not ping out from any lan node.
I can't recall clearly, but I think the interface was up. This was the case even though prior to the upgrade and reboot, the wan was up and working, of course, or the upgrade would not have happened.
When I rebooted back into the 24.11 boot environment, it showed the wan gateway and interface as up and it could ping and reach the internet. So something in the upgrade process hosed the wan gateway.
-
How is the WAN configured on that 1100? Just a DHCP WAN?
-
@stephenw10, correct, for ipv4.
IPv6 is dhcp6 with some advanced options to get starlink IPv6 addresses. The ipv4 had been working as configured for several years, and IPv6 for about a year.
After doing the from-scratch reinstall of 25.07.1 and restoring a backup configuration from the first 1100 onto the second 1100, the wan worked again without any modifications.
The same wan configuration works on the 4200 after Netgate converted the config file for me from the 1100 to my new 4200 when I bought it. The 4200 overall config has drifted apart from the 1100 since but the wan is still the same except for the monitoring ip targets.
The 4200 is still on 24.11 until I can get a dst to try an upgrade with enough time to deal with a failure should it happen.
-
Hmm. Well if you can replicate the WAN failure after upgrade it would be good to grab a status_output file to check. I've not seen that on an 1100 here.