Update to 2.3.2 Failed, Now Can't Update at All

xenium1024

First, I have to get this off my chest: I'm a bit annoyed. I switched from a Linux based firewall to pfSense a couple of years ago because pfSense was known to be the most stable, dependable, reliable solution among the easy to use, GUI-based open source firewall projects. These last few update cycles though it seems like each update just breaks harder and harder and requires more and more administrative effort just to get back to a stable, usable system again. I'm a pretty competent engineer, so I'm comfortable digging in when I have to, but damn, I really shouldn't have to do a multiple-day deep-dive of pfSense and BSD every time I hit the flashing "update now!" icon on the GUI dashboard. That's just pathetic. Do new production releases of pfSense even get any serious testing any more? How can there be such major regressions in every new release, even minor releases? Sigh. Ok, end rant. Here's my issue and the history leading up to it…

I'm running the nano version of pfSense on Alix hardware.

A few months ago I attempted to upgrade from whatever version of pfSense I was running at the time to 2.3.1. I can't remember all of the things that went wrong in the process, but it was not a smooth upgrade. I should have kept notes, but it's my home network and I was just clicking an "update now" button in a GUI, so wasn't expecting a big involved process to need to document. One thing in particular that I do remember though is that the nrpe2 package had been removed (why does the upgrader not warn you of deprecated packages that you are running btw? It just rips them out! Jesus.) Because I use Nagios to monitor quite a number of things behind this firewall, that was a deal-breaker. I spent some time researching potential solutions and workarounds, including one describing how to bring the package in manually from BSD. Just as I was about to try that route I found a post stating that the nrpe2 package was now actually being put back in. Sure enough, the next day I updated again, and the package was back. I think I had to do some reconfiguration to get everything working again. I honestly can't recall if I had to edit anything from the shell, but I don't think I did, I think it was all through the GUI. I eventually ended up with nrpe2 working again, and a box running 2.3.1-RELEASE-p1. Again, the nrpe2 package wasn't the only issue with those two upgrades, just the one I remember wasting the most of my time on. :-( I eventually got through it all though.

Today I noticed there was an update available again, so with some anxiety given the problems with the last two updates, I hit the dashboard button. It zipped through the update process VERY quickly, as in obviously far too quickly - it took maybe three seconds from the time I hit the button until the time it claimed it was done and waiting for a reboot. Reboot it did, and I found that sure enough, nothing had been updated, I'm still on 2.3.1-RELEASE-p1. NOW though, the dashboard was saying "Unable to check for updates", so something obviously got updated/borked. I found a number of threads here about this issue, and tried the things suggested (like hitting "save" on the "update settings" dialog, verifying that RAM disks aren't enabled, rebooting, etc.) Same thing. So I tried an update from the command line. It produced quite a lot of output, but eventually failed. So I rebooted, tried again, and got the same thing: lots of output, but ultimately failure. I noticed a message about "The process will require 2 MiB more space." sprinkled here and there throughout the output of the upgrade process, which does not make any sense to me: if that's an error indicating there isn't enough space, why doesn't the installer exit? It just plows ahead instead? And also, df shows considerable space available on every partition, so I think that error message is a red herring, although I am curious what is causing it:

Filesystem        1K-blocks   Used   Avail Capacity  Mounted on
/dev/ufs/pfsense1   1890014 495580 1243233    29%    /
devfs                     1      1       0   100%    /dev
/dev/ufs/cf           50527   6284   40201    14%    /cf
/dev/md0              39196    208   35856     1%    /tmp
/dev/md1              59036  20868   33448    38%    /var
devfs                     1      1       0   100%    /var/dhcpd/dev

The box seems to be working normally, other than the inability to check for updates, and the inability to install updates. If I go to System->Update in the GUI, it shows nothing next to "Current Base System: and "Latest Base System", and an infinitely rotating yellow gear next to "Retrieving".

I've attached the output from the failed update attempt from the command-line. The last few lines were this:

The process will require 2 MiB more space.
[1/66] Upgrading gettext-runtime from 0.19.7 to 0.19.8.1…
[1/66] Extracting gettext-runtime-0.19.8.1: …....... done
[2/66] Upgrading python27 from 2.7.11_2 to 2.7.12…
[2/66] Extracting python27-2.7.12: ….
pkg: Fail to extract /usr/local/lib/python2.7/lib2to3/tests/test_fixers.py from package: Lzma library error: Corrupted input data
[2/66] Extracting python27-2.7.12… done
>>> Locking package pfSense-kernel-pfSense_wrap... done.
pkg: open(/bin/sh): No such file or directory
pkg: open(/bin/sh): No such file or directory
pkg: Unable to determine ABI
pkg: Cannot parse configuration file!
pkg: open(/bin/sh): No such file or directory
pkg: open(/bin/sh): No such file or directory
pkg: Unable to determine ABI
pkg: Cannot parse configuration file!

So now what do I do? I don't even know where to start. Sigh. :-(

pfsense-update.log.txt

battles

A bit aggravated here also, but as a software engineer, I know the hell that can break loose upon changing a single instruction. What is concerning is that there are no gurus making comments as to the problem. From your comments, I know now to never update until after several days at the least. I am new to all this and had just gotten everything running nicely. I thought a restore from a backup would fix it, but the only thing it did was put the snort menu item back in the services menu. That menu now takes me to a 404 page. I have tried to start snort via the Status/Services menu item, but it just dies every time. The next thing I am considering is a 'Reset to factory default' from the terminal. I guess we will be hearing something soon.

Update: I found the fix here: https://forum.pfsense.org/index.php?topic=115777.0

jimp

The specific errors you cite point to a dead/dying disk. It was probably already on the way out, but the activity from the update likely pushed it over the edge.

xenium1024

@jimp:

The specific errors you cite point to a dead/dying disk. It was probably already on the way out, but the activity from the update likely pushed it over the edge.

Thanks Jim,

What specific errors lead you to the conclusion that it is likely a dead/dying disk? As I mentioned, this is an ALIX system running Nano, so there are actually no spinning disks in the system, it is running off of a 4GB CompactFlash (not that CF drives can't fail, indeed they're kind of known for failing with excessive writes, I just would expect a more catastrophic, complete failure, rather than a slow death like a traditional magnetic disk?)

It's also interesting that the system seems to otherwise boot and run normally, and every attempt to upgrade now fails at the exact same point.

I have another CF drive. (I had to replace it even though it was working flawlessly to work around a specific pfSense bug with certain older CF drives.) I could try DD'ing this one to that to confirm your hypothesis if you're fairly certain that's what it is? I don't see anything in syslog to indicate a drive failure. As I said, I'm not really a BSD guy, so I may not be looking in the right place?:

[2.3.1-RELEASE][root@gw-home.nullmodem.org]/root: clog /var/log/system.log | grep ada0
Aug 1 21:34:50 gw-home kernel: ada0 at ata0 bus 0 scbus0 target 0 lun 0
Aug 1 21:34:50 gw-home kernel: ada0: <sandisk sdcfhsnjc-004g="" hdx="" 7.08="">CFA device
Aug 1 21:34:50 gw-home kernel: ada0: Serial Number BFZ072215011442
Aug 1 21:34:50 gw-home kernel: ada0: 100.000MB/s transfers (UDMA5, PIO 512bytes)
Aug 1 21:34:50 gw-home kernel: ada0: 3815MB (7813120 512 byte sectors)
Aug 1 21:34:50 gw-home kernel: ada0: Previously was known as ad0
Aug 1 21:35:29 gw-home snmpd[43749]: disk_OS_get_disks: adding device 'ada0' to device list

I'm going to start out by putting it in single user mode and trying a fsck before swapping out the drive, I'll post my results when done.</sandisk>

xenium1024

Well, I'm having all kinds of problems getting a serial console on this thing now, due to a buggy usb-to-serial adapter. I just ordered a replacement, but realistically it will probably be a week or more before I'll be able to get this into single user mode to run a fsck or clone the flash drive. :-(

In the meantime, if anyone has any suggestions that I might try on the running system either to further troubleshoot or to confirm that the flash may be going bad that don't require console access I'd really appreciate it.

jimp

Every error in the original output. Corrupted data, missing files, unable to read, etc.

CF is so cheap it isn't worth the hassle. Just image a fresh one, boot it up, restore a backup. Doesn't need console access at all.

xenium1024

Every error in the original output. Corrupted data, missing files, unable to read, etc.

Well, I admit I'm mainly a Linux, Windows, and hardware guy, I'm by no means a pfSense or BSD expert, but in my experience those sort of errors can easily be caused by other things: a corrupt package in the repo, a corrupt but cached local package, running out of disk space, etc. Really, any serious exception that isn't handled by the script could cause things like this. If the script fails to fully extract a package file under any of these circumstances but doesn't do adequate error checking and thus tries to execute (now missing or corrupt) files from that package for example, you'd get errors exactly like this. And as I mentioned, I kept getting messages saying that "an additional 2MiB are needed". If it was disk corruption, I wouldn't expect it to fail so consistently in exactly the same place every time.

My gut feeling is that it's not a disk error, but you're also right: it's an embedded system and I should be thinking of it like that. The right way to troubleshoot this is to just reflash a clean image to a new compactflash card, and restore the config, not try to deep-dive the root cause, so I did that today. Again, I don't have reliable console access so I couldn't watch the boot, but on the new flash card it seemed to come up after restoring the config (I could get to the internet from hosts behind it), but when I tried to access the web UI it stopped working (could no longer get to the internet, could no longer ping the router), and one of the status lights was flashing "SOS" in morse code. (I assume that's the BSD indicator of a kernel panic or something similar?) It did this several times, but then on I believe the third attempt it seemed stable, and has been running fine since. I'm now on 2.3.2-RELEASE, and the dashboard is able to check for updates again. Those initial crashes do concern me though. My only guess is that there are first-run scripts that were choking on something in my config and causing the crash? I noticed it had to generate new SSH keys on the first run for example, perhaps something like that was crashing it? Shrug.

This spare card is a smaller, slower card however, so I still need to determine if the original card was the reason for the failure. I therefore plugged it into a Linux box and used dd to make an image of it. I'm now running a destructive badblocks test against it. Badblocks makes four passes, one each with the patterns 0xaa, 0x55, 0xff, and 0x00. It's gotten through the first three of those four already with no errors. I'll post the results when it's finished. So far it's not looking like a disk error though.

CF is so cheap it isn't worth the hassle. Just image a fresh one, boot it up, restore a backup. Doesn't need console access at all.

Easy for you to say. :-) For some of us, $20 is still $20. True, it's not a whole lot of money, but I'd prefer not to have to spend it to replace something that isn't broken. I already had to replace a perfectly functional card because of this bug, which the devs have clearly said multiple times they will not fix: https://redmine.pfsense.org/issues/4814 :-( I really don't want to throw this one away and replace it again if this is just an upgrade script bug, as I suspect it is. I'm sorry but pfSense seems to be getting sloppy based on my recent experiences. It's a shame, it used to be such a solid product. :-(

xenium1024

Well, more good news… I went through some boxes of gear and found an old PCI serial card, so I was able to install that in a server that's located next to the firewall and so now I finally have a reliable serial console on it. That makes life easier. :-)

I ran the destructive badblocks check against the 4GB flash drive, twice. It came back clean both times. (So that's 8 cycles of write/read of various bit patterns to every block on the disk.) So I'm fairly sure the flash is ok. I wrote the 4GB 2.3.2 image to it, booted, connected to it and restored my config, and this time it booted up fine, it didn't become unresponsive and require several reboots to get it stable like it did when I first booted off of the 2GB flash earlier today. I wish I had had the serial console then so I could have seen what was going on. But in any event, it seems fine now with the original flash drive back in it.

Also, I learned that the flashing "SOS" LED simply means there's an unread alert, so that was a red herring, not an indicator of a problem.

So the bottom line is, there was nothing wrong with the disk; the pfSense automatic update process did indeed break my box rather badly. Thank you though Jim, for getting me thinking about this the right way. I sometimes forget that pfSense truly is an embedded system with a single XML config file, so it was dumb of me to be banging on it trying to fix it when I could simply re-flash and restore config. That got me back up and running.

Still sucky that clicking an "update now!" button in a GUI can break your config though, even if it's easy to recover from. :-(