ACB backup - issues with network interface config and package restore
-
Hi All,
Friday morning fun - woke up to LibreNMS alerts that something was wrong with our main PFSense firewall... I have contingency plans for this so logged in from home via a backup VPN and used out of band KVM console access for the hardware running PFSense - the problem was immediately apparent from the PFSense console errors, boot SSD failure on a 4 year old server...
The firewall (2.8.1 CE) was still actually running for routing traffic, however anything that needed the slightest disk access such as Squid or OpenVPN was dead and unresponsive and I could not log in via SSH or Web Interface.
I use Auto Config Backup for automatic backup, a manual (albeit slightly out of date) backup and have a Proxmox virtual machine set aside for such contingencies so proceeded to attempt to retore an ACB backup, that's when I started to run into problems.
I could have sworn that when I have attempted to restore PFSense backups onto different hardware in the past and there is a mismatch in interface naming (as there usually would be for different hardware, in this instance, virtual hardware) that you are automatically taken to the interface configuration screen to review the interface name mismatches and make adjustments BEFORE rebooting.
However as soon as the backup was uploaded the only option was a prompt to reboot, no warning that there were interface mismatches or opportunity to correct them. Does the interface mismatch detection only work when uploading a manual backup, not when restoring an ACB backup ?
Naturally this came up with interface mismatches on next boot, which would could not be dealt with unless you had out of band KVM access, as I do, or in this case access to the host running the VM. However I have 12 VLAN's configured and the console also kept printing other messages making it nearly impossible to do a manual configuration of 12 different VLAN's at the console due to constantly scrolling text.
So I started again and this time instead of choosing reboot when offered I went to the VLAN configuration screen, updated the interface for all the VLAN's then to interface assignment and updated those as well. I then manually rebooted.
This configured the network correctly without having to mess around at the local console.
Second problem - none of the packages that are configured in the backup reinstalled... the "Reinstall all packages" button in the backup/restore section that you usually see after a restore from backup was missing and there was no warning that packages were installing in the background. (I also tailed /var/log/system.log for a bit to make sure there was no background installation happening)
All the packages still had menu options visible for example squid, however they were trying to go to PHP pages that were not installed. Likewise the service status widget still listed all the packages that should have been there. (But services were obviously not running)
It's as if all the configuration files for packages were present but the system simply forgot to reinstall them. So I went to package manager and one by one installed everything that was missing.
After that things seem to be working OK - all the configuration was retained.
Should restoring an ACB backup trigger automatic package reinstallation for missing packages ? I think it should.
The following day I decided for various reasons that I wanted to move PFSense from the Proxmox host to a much more powerful server running Hyper-V, so I went through the whole process again but under more controlled circumstances at a more suitable time of day.
This time I did a fresh install of PFSense CE 2.8.1 via the netgate installer into the Hyper-V VM, and restored the same ACB backup.
Again it only offered to reboot and did not detect the interface mismatches, but I declined and configured the network interfaces for VLAN's and primary interfaces before rebooting.
After reboot the network was correctly configured but no packages were installed once again, and there was no background installation taking place. Oddly, this time the "Reinstall all packages" button WAS present, however it only offered to install one of the 12 packages that I had installed that were referenced in the backup. (IPerf)
So again I had to manually install all the packages from memory one by one and then everything was fine with package configuration retained.
Not sure how much of these two issues are bugs or whether proper handling of interface mismatches and reinstallation of packages is simply not implemented in ACB restores.. ?
Hardware failure resulting in having to restore a PFSense backup into new hardware or a virtual machine would seem to be a common use case for nightly ACB backups, but if interface mismatches and package reinstallation are not handled correctly this leads to a lot more downtime and hair pulling when the restore process doesn't go smoothly.
The failure to warn about interface mismatches before reboot could be particularly devastating in the case that you don't have KVM access to the underlying hardware and then need physical access to fix the interface mismatches.
-
Hmm, interesting. I would expect both those things as you did. However I have rarely used ACB between different hardwares.
What interface types did you have configured? It shouldn't really matter in the second case since I assume you moved to hn(4) NICs?
-
Hi Stephen, thanks for the reply - I wasn't sure when posting whether prompting to choose interfaces if there is a mismatch and automatic package re-installation were expected to happen (and therefore I'm seeing a bug) or whether ACB restores don't implement this functionality.
When I do a manual restore via a manually saved local backup it does detect interface device name mismatches and give a chance to correct them and also does trigger automatic package reinstallation after a reboot, but neither of these seem to happen when restoring an ACB, you just get a prompt to reboot, and if you follow that advice you can easily reboot to a partially broken system where interface naming has to be fixed at the console and packages have to be reinstalled one by one.
The interfaces on the original hardware are ix0 for LAN (with 12 VLAN's attached to that) and ix1 for WAN. Both are ports on a dual port 10Gb intel Fibre PCI card.
On Proxmox, which was where I initially restored to in a hurry, the interface names are vtnet0 for LAN and vtnet1 for WAN using VirtIO paravirtualized adaptors.
On Hyper-V on migration two the interfaces in a Gen 2 machine are hn0 for WAN and hn1 for Lan, and the same issue happened there.
In both cases the absence of ix0 and ix1 should ideally trigger a warning that ix0 and ix1 don't exist and offer for me to remap them.
I get the impression that ACB restores currently assume you are restoring a configuration backup to the same hardware and that required packages are already installed - in other words that you're only doing a configuration rollback not a fresh install or a hardware migration.
But when you restore onto a fresh installation on the same hardware you're going to hit the issue that packages are not reinstalled automatically, and if you have to restore on different hardware you're also going to hit the interface naming issue.
At the end of the day it's not a huge problem now that I know about it - just decline the reboot prompt, reconfigure network interfaces in the GUI, reboot, then install packages manually.
But it feels like it could go more smoothly than this - and ACB does allow you to paste in the secret from your old installation to recover backups onto new hardware, so it seems at least some thought was given to restoring ACB backups on new hardware.
I will be doing the reverse - restoring back onto original hardware soon, possibly tonight or tomorrow as I am expecting replacement SSDs today..
Previously it was a single SSD, this time I'm going to configure two in a ZFS mirror... even though there are no hot swap bays to swap a faulty drive without a power cycle (small 1U rack server without front drive bays) it will still be better than a single point of failure and give me time to put a standby virtual machine in place to cover the downtime needed to swap a failed drive in the unlikely event it happened again...
For anyone curious, the failed drive is a Crucial MX500 2.5SSD (CT250MX500SSD1) which is running the latest firmware - it has failed in a READ ONLY state where it is impossible to write to, but can still be read and even firmware version can be queried.
The PFSense console log was full of errors reporting unable to write, and this was confirmed plugging it into another PC which also can't write to it, either under Windows or even running Spinrite.
Oddly, Crucial storage executive under Windows says the drive is fine despite the fact that nothing can write to it!
-
Mmm, it does seem like a bug. I'm surprised there isn't something open for it. I'll try to replicate it here and open something.