Two 1100s, same rev, same config, 50% update failure rate from 23.05.1 to 23.09.
-
I have two Netgate 1100s, formerly running 23.05.1, now 23.09. One is a production box, and one is a cold spare/standby, which I update then rotate into production then update the former production box.
For rloeb of "23.09 disaster; reverted to 23.05 and multiple problems" where production is mandated by contract, having a cold spare/standby might be a very worthwhile approach.
That aside, the update process on the cold spare, let's call it box "B," went pretty well. I took notes of each step in order so I could reproduce the steps on the production unit once swapped out.
Update process on the (now former) production unit, let's call it box "A," went poorly. I repeated the steps for box "B" but the console output started going on about a missing file, dumped me into the "#" shell prompt and stopped there. I exited the shell and it continued but went on about another missing file. I gave up and reinstalled from scratch from the 23.09 image furnished by support prior to starting. It's done and both have the same configuration now and same version.
I got it going. I'm not going to jump through hoops to troubleshoot the update process, especially given it wasn't reproducible between two boxes of the same type, rev level and configuration. My observations are:
-
This is a typical update experience for me: it's not been repeatable or reliable any time I've tried even though both 1100s have same configuration and same software rev level at the start. I'm not an Unix nor network expert, but I'm not a moron either. I've got a degree in computer science and almost 40 years of professional experience, so I should be able to get two boxes with the same configuration and software to do the same thing twice in a row using the manual and with the specific steps written down as I did them.
-
Each of my two 1100s were defective when purchased many months apart and each had to be replaced, for a 100% failure rate brand new out of the box.
-
I am very happy with pfSense's stability and usability, once in production.
-
However, pfSense's software quality management seems to not be engineering robustness into the update process. Based on the console output, it seems to make assumptions about the state of the installation, and seems to fail to trap all failures of those assumptions or errors and recover to a documented state. This then leaves one to have to dust off and nuke the site from orbit, even when doing modest point-level updates.
-
Netgate's hardware quality assurance also appears to need improvement, given my purchase experience.
Netgate should seriously consider re-engineering and re-developing the installation and update processes to manage assumptions, trap errors and fail safe. This would be better to prioritize over UI corrections and adding features.
-
-
Was the 'B' 1100 offline since the last update or still able to pull updated pkgs etc during that time?
-
@stephenw10, B was online until this morning, when I swapped it out for box A to see how A would work.
As part of the update process on both A and B, I deleted all the packages while still under 23.05.1. I then ran the update to 23.09, and in the case of box A, restored the configuration. With box B, the configuration stayed in place through the update.
Then with 23.09 running post update, I reinstalled the packages "new" from the available packages list in the package manager.
I did not run the updates with the packages installed. The preexisting packages configurations seemed to get picked up without difficulty once I installed the packages.
-
Hmm, that's about as safe as it could be then.
Your description of the failure sounds like it might have somehow pulled in a pkg from 23.09 before the upgrade resulting in a mismatch at some point. I'm not sure how that could have happened but clearly if it wasn't online it couldn't have happened.