Backup before upgrade fills hard drive, can't find kernel

lexrc

After attempting an upgrade to 2.2.1 from 2.2, the system was left unable to boot. I had a 12GB drive with several GB allocated to squid cache. The auto-backup failed leaving the hard drive full. As you can see, it appears the upgrade continued after the backup failed.

[2.2-RELEASE][root@pfSense.localdomain]/tmp/hdrescue/cf/conf: df -h /dev/ada*
df: /dev/ada0: Invalid argument
df: /dev/ada0s1: Invalid argument
df: /dev/ada0s1b: Invalid argument
Filesystem Size Used Avail Capacity Mounted on
/dev/ada0s1a 12G 11G -241M 102% /tmp/hdrescue
[2.2-RELEASE][root@pfSense.localdomain]/tmp/hdrescue/cf/conf:

–-------------------------------------------------------------------------------------------------------------------------------------------------------------------

[2.2-RELEASE][root@pfSense.localdomain]/tmp/hdrescue/cf/conf: less firmware_update_misc_log.txt

tar: Failed to set default locale

bzcat: Compressed file ends unexpectedly;
perhaps it is corrupted? Possible reason follows.
bzcat: No such file or directory
Input file = /tmp/chflags.dist.usr.bz2, output file = (stdout)

It is possible that the compressed file(s) have become corrupted.
You can use the -tvv option to test integrity of such files.

You can use the `bzip2recover' program to attempt to recover
data from undamaged sections of corrupted files.

shutdown: [pid 61121]
firmware_update_misc_log.txt (END)

–-------------------------------------------------------------------------------------------------------------------------------------------------------------------
[2.2-RELEASE][root@pfSense.localdomain]/tmp/hdrescue/cf/conf: less upgrade_log.txt

pfSenseupgrade upgrade starting

Sat Mar 21 02:54:56 EDT 2015

-rw-r–r-- 1 root wheel 82M Mar 21 02:17 /root/latest.tgz

MD5 (/root/latest.tgz) = 0dffafc9f3f815bc0cdba775b62ccdaf

/dev/ada0s1a on / (ufs, local)
devfs on /dev (devfs, local)
/dev/md0 on /var/run (ufs, local)
devfs on /var/dhcpd/dev (devfs, local)
fdescfs on /dev/fd (fdescfs)
/dev/md10 on /var/tmp/havpRAM (ufs, local, soft-updates)

last pid: 17437; load averages: 0.56, 0.58, 0.52 up 8+15:37:36 02:54:57
72 processes: 1 running, 67 sleeping, 4 zombie

Mem: 167M Active, 1472M Inact, 289M Wired, 11M Cache, 211M Buf, 21M Free
Swap: 4096M Total, 2470M Used, 1626M Free, 60% Inuse

PID USERNAME THR PRI NICE SIZE RES STATE TIME WCPU COMMAND
42616 proxy 17 20 0 3173M 676M uwait 20:50 0.00% squid
3668 nobody 1 20 0 23164K 4624K select 5:51 0.00% darkstat
19251 root 1 20 0 12464K 1840K select 2:53 0.00% apinger
23730 root 1 20 0 54892K 6500K kqread 2:12 0.00% lighttpd
14439 root 1 20 0 16812K 1864K bpf 1:17 0.00% filterlog
68981 root 1 20 0 14676K 2020K select 0:37 0.00% syslogd
245 root 1 20 0 219M 8992K kqread 0:29 0.00% php-fpm
73747 _pflogd 1 20 0 14752K 1972K bpf 0:29 0.00% pflogd
76524 _pflogd 1 20 0 14752K 1968K bpf 0:28 0.00% pflogd
90146 _spamd 1 20 0 23004K 3632K bpf 0:19 0.00% spamlogd
91262 _spamd 1 20 0 23004K 3632K bpf 0:18 0.00% spamlogd
72923 root 1 24 0 17144K 704K wait 0:11 0.00% sh
19494 root 1 20 0 28332K 2016K piperd 0:07 0.00% rrdtool
42954 root 1 20 0 16672K 1192K nanslp 0:02 0.00% cron
3978 nobody 1 20 0 19068K 1220K sbwait 0:01 0.00% darkstat
262 root 1 40 20 19032K 1028K kqread 0:01 0.00% check_reload_status
276 root 1 20 0 13164K 364K select 0:00 0.00% devd
67782 root 17 20 0 207M 7084K uwait 0:00 0.00% charon

bzip2: I/O or other error, bailing out. Possible reason follows.
bzip2: No space left on device
Input file = (stdin), output file = (stdout)
tar: Failed to set default locale
x ./tmp/pre_upgrade_command
Firmware upgrade in progress… Content-type: text/html

Installing /root/latest.tgz.
tar: Failed to set default locale
./usr/local/lib/ipsec/libstrongswan.so.0: Write failed
./usr/local/lib/ipsec/libhydra.so.0: Write to restore size failed
./usr/local/lib/ipsec/libcharon.so.0: Write to restore size failed
./usr/local/lib/ipsec/libcharon.so: Write to restore size failed
./usr/local/lib/ipsec/libhydra.so: Write to restore size failed
./usr/local/lib/ipsec/libradius.so: Write to restore size failed
./usr/local/lib/ipsec/libradius.so.0: Write to restore size failed
./usr/local/lib/ipsec/libsimaka.so: Write to restore size failed
./usr/local/lib/ipsec/libsimaka.so.0: Write to restore size failed
./usr/local/lib/ipsec/libstrongswan.so: Write to restore size failed
./usr/local/lib/ipsec/libtls.so: Write to restore size failed
./usr/local/lib/ipsec/libtls.so.0: Write to restore size failed
./usr/local/lib/ipsec/plugins/: Write to restore size failed
./usr/local/lib/ipsec/plugins/libstrongswan-addrblock.so: Write to restore size failed
./usr/local/lib/ipsec/plugins/libstrongswan-aes.so: Write to restore size failed
./usr/local/lib/ipsec/plugins/libstrongswan-attr.so: Write to restore size failed
./usr/local/lib/ipsec/plugins/libstrongswan-blowfish.so: Write to restore size failed
./usr/local/lib/ipsec/plugins/libstrongswan-cmac.so: Write to restore size failed
./usr/local/lib/ipsec/plugins/libstrongswan-constraints.so: Write to restore size failed
./usr/local/lib/ipsec/plugins/libstrongswan-curl.so: Write to restore size failed
./usr/local/lib/ipsec/plugins/libstrongswan-des.so: Write to restore size failed
./usr/local/lib/ipsec/plugins/libstrongswan-dnskey.so: Write to restore size failed
./usr/local/lib/ipsec/plugins/libstrongswan-eap-aka-3gpp2.so: Write to restore size failed

doktornotor

And what solution are you expecting? Get a bigger drive. Or don't backup junk like squid cache.

lexrc

I'm expecting the upgrade to not kill the box. This was my first upgrade of pfsense. It'd only been running for a week or so so I just rebuilt the vm and gave it a bigger drive. This gave me the opportunity to perform a cleaner install the second time around.

Obviously I should have checkpointed the vm before the upgrade, lesson learned, but when doing the auto-upgrade there was a checkbox saying something along the lines of perform a full backup prior to upgrade. This seemed like the safest option BUT instead it killed the system. If I had opted out of the backup I would have been fine.

The backup failed because the drive was full yet the update proceeded to kill the installation without doing any sanity checks.

This post was more of a warning to others and should probably be a bug report but I didn't want to file one without posting here first in case it was a known issue.

lexrc

Also just pointing out the only reason I gave the vm any additional space was to store the squid cache. As a newbie, that seems like a reasonable thing to do.

lexrc

Submitted a bug report. https://redmine.pfsense.org/issues/4549

kejianshi

Yeah - A full HD is quite a bug.

Next time back up the configuration and save it on a desktop somewhere (not the backup while upgrading button)

Then make a new vm, install a fresh pfsense and restore the config.

lexrc

Filling up the hard drive during a backup isn't the bug. The bug is trying to extract the update while the drive is full and wiping out the kernel leaving the machine unable to boot. The drive had 40% free space when I started the upgrade.

I've never seen another OS start an upgrade without verifying capacity. Windows, Ubuntu, android, ios, macos all do sanity checks. It's not asking that much. I'm even willing to work on the bug myself and submit a patch.

Dismissing this out of hand doesn't help. Remember,I used the UI to kick off the upgrade and chose what appeared to be the conservative path by creating a backup first. This left the machine in an unusable state. That is completely unacceptable.

kejianshi

If you have alot of access to the VM, do always make a backup of the config on your desktop somewhere so that if you hit a snag you are not buggered.

lexrc

Yeah, I'll definitely be checkpointing in the future. I've looked through the code and found a couple of things:

Doing an upgrade on a nano install does check to see that the image is not larger than the partition (but does not check free space) around line 208. https://github.com/pfsense/pfsense/blob/RELENG_2_2/etc/rc.firmware
The pre-upgrade command does not check for free space and the last thing it does is remove all kernels on line 50. https://github.com/pfsense/pfsense/blob/RELENG_2_2/tmp/pre_upgrade_command

So what happens is the backup fills the drive, all kernels are deleted then the updated archive is extracted. This eventually fills the space that was freed by removing the original kernels before the new kernel can be extracted.

Fixing this is complicated by the fact that nano installs may not have free space to extract the entire archive, they rely on overwriting files, so checking for free space to fit the entire archive may mean upgrades are blocked when they could possibly complete.

I think the easiest way to work around this may be to modify pre_upgrade_command to extract the new kernel into /tmp, verify it is good, then remove old kernels and mv the new /tmp/kernel to /boot/kernel/. Then if you run out of space extracting the upgrade, you may have mismatched files but you will at least have a kernel file that can boot so you can manually free space and re-extract the archive.

Thoughts?

doktornotor

Verifying free space on nanobsd would be utterly pointless. The entire slice is overwritten.