Backup before upgrade fills hard drive, can't find kernel



  • After attempting an upgrade to 2.2.1 from 2.2, the system was left unable to boot.  I had a 12GB drive with several GB allocated to squid cache.  The auto-backup failed leaving the hard drive full.  As you can see, it appears the upgrade continued after the backup failed.

    [2.2-RELEASE][root@pfSense.localdomain]/tmp/hdrescue/cf/conf: df -h /dev/ada*
    df: /dev/ada0: Invalid argument
    df: /dev/ada0s1: Invalid argument
    df: /dev/ada0s1b: Invalid argument
    Filesystem      Size    Used  Avail Capacity  Mounted on
    /dev/ada0s1a    12G    11G  -241M  102%    /tmp/hdrescue
    [2.2-RELEASE][root@pfSense.localdomain]/tmp/hdrescue/cf/conf:

    –-------------------------------------------------------------------------------------------------------------------------------------------------------------------

    [2.2-RELEASE][root@pfSense.localdomain]/tmp/hdrescue/cf/conf: less firmware_update_misc_log.txt

    tar: Failed to set default locale

    bzcat: Compressed file ends unexpectedly;
            perhaps it is corrupted?  Possible reason follows.
    bzcat: No such file or directory
            Input file = /tmp/chflags.dist.usr.bz2, output file = (stdout)

    It is possible that the compressed file(s) have become corrupted.
    You can use the -tvv option to test integrity of such files.

    You can use the `bzip2recover' program to attempt to recover
    data from undamaged sections of corrupted files.

    shutdown: [pid 61121]
    firmware_update_misc_log.txt (END)

    –-------------------------------------------------------------------------------------------------------------------------------------------------------------------
    [2.2-RELEASE][root@pfSense.localdomain]/tmp/hdrescue/cf/conf: less upgrade_log.txt

    pfSenseupgrade upgrade starting

    Sat Mar 21 02:54:56 EDT 2015

    -rw-r–r--  1 root  wheel    82M Mar 21 02:17 /root/latest.tgz

    MD5 (/root/latest.tgz) = 0dffafc9f3f815bc0cdba775b62ccdaf

    /dev/ada0s1a on / (ufs, local)
    devfs on /dev (devfs, local)
    /dev/md0 on /var/run (ufs, local)
    devfs on /var/dhcpd/dev (devfs, local)
    fdescfs on /dev/fd (fdescfs)
    /dev/md10 on /var/tmp/havpRAM (ufs, local, soft-updates)

    last pid: 17437;  load averages:  0.56,  0.58,  0.52  up 8+15:37:36    02:54:57
    72 processes:  1 running, 67 sleeping, 4 zombie

    Mem: 167M Active, 1472M Inact, 289M Wired, 11M Cache, 211M Buf, 21M Free
    Swap: 4096M Total, 2470M Used, 1626M Free, 60% Inuse

    PID USERNAME  THR PRI NICE  SIZE    RES STATE    TIME    WCPU COMMAND
    42616 proxy      17  20    0  3173M  676M uwait  20:50  0.00% squid
    3668 nobody      1  20    0 23164K  4624K select  5:51  0.00% darkstat
    19251 root        1  20    0 12464K  1840K select  2:53  0.00% apinger
    23730 root        1  20    0 54892K  6500K kqread  2:12  0.00% lighttpd
    14439 root        1  20    0 16812K  1864K bpf      1:17  0.00% filterlog
    68981 root        1  20    0 14676K  2020K select  0:37  0.00% syslogd
      245 root        1  20    0  219M  8992K kqread  0:29  0.00% php-fpm
    73747 _pflogd    1  20    0 14752K  1972K bpf      0:29  0.00% pflogd
    76524 _pflogd    1  20    0 14752K  1968K bpf      0:28  0.00% pflogd
    90146 _spamd      1  20    0 23004K  3632K bpf      0:19  0.00% spamlogd
    91262 _spamd      1  20    0 23004K  3632K bpf      0:18  0.00% spamlogd
    72923 root        1  24    0 17144K  704K wait    0:11  0.00% sh
    19494 root        1  20    0 28332K  2016K piperd  0:07  0.00% rrdtool
    42954 root        1  20    0 16672K  1192K nanslp  0:02  0.00% cron
    3978 nobody      1  20    0 19068K  1220K sbwait  0:01  0.00% darkstat
      262 root        1  40  20 19032K  1028K kqread  0:01  0.00% check_reload_status
      276 root        1  20    0 13164K  364K select  0:00  0.00% devd
    67782 root      17  20    0  207M  7084K uwait    0:00  0.00% charon

    bzip2: I/O or other error, bailing out.  Possible reason follows.
    bzip2: No space left on device
            Input file = (stdin), output file = (stdout)
    tar: Failed to set default locale
    x ./tmp/pre_upgrade_command
    Firmware upgrade in progress…
    Content-type: text/html

    Installing /root/latest.tgz.
    tar: Failed to set default locale
    ./usr/local/lib/ipsec/libstrongswan.so.0: Write failed
    ./usr/local/lib/ipsec/libhydra.so.0: Write to restore size failed
    ./usr/local/lib/ipsec/libcharon.so.0: Write to restore size failed
    ./usr/local/lib/ipsec/libcharon.so: Write to restore size failed
    ./usr/local/lib/ipsec/libhydra.so: Write to restore size failed
    ./usr/local/lib/ipsec/libradius.so: Write to restore size failed
    ./usr/local/lib/ipsec/libradius.so.0: Write to restore size failed
    ./usr/local/lib/ipsec/libsimaka.so: Write to restore size failed
    ./usr/local/lib/ipsec/libsimaka.so.0: Write to restore size failed
    ./usr/local/lib/ipsec/libstrongswan.so: Write to restore size failed
    ./usr/local/lib/ipsec/libtls.so: Write to restore size failed
    ./usr/local/lib/ipsec/libtls.so.0: Write to restore size failed
    ./usr/local/lib/ipsec/plugins/: Write to restore size failed
    ./usr/local/lib/ipsec/plugins/libstrongswan-addrblock.so: Write to restore size failed
    ./usr/local/lib/ipsec/plugins/libstrongswan-aes.so: Write to restore size failed
    ./usr/local/lib/ipsec/plugins/libstrongswan-attr.so: Write to restore size failed
    ./usr/local/lib/ipsec/plugins/libstrongswan-blowfish.so: Write to restore size failed
    ./usr/local/lib/ipsec/plugins/libstrongswan-cmac.so: Write to restore size failed
    ./usr/local/lib/ipsec/plugins/libstrongswan-constraints.so: Write to restore size failed
    ./usr/local/lib/ipsec/plugins/libstrongswan-curl.so: Write to restore size failed
    ./usr/local/lib/ipsec/plugins/libstrongswan-des.so: Write to restore size failed
    ./usr/local/lib/ipsec/plugins/libstrongswan-dnskey.so: Write to restore size failed
    ./usr/local/lib/ipsec/plugins/libstrongswan-eap-aka-3gpp2.so: Write to restore size failed


  • Banned

    And what solution are you expecting? Get a bigger drive. Or don't backup junk like squid cache.



  • I'm expecting the upgrade to not kill the box. This was my first upgrade of pfsense. It'd only been running for a week or so so I just rebuilt the vm and gave it a bigger drive.  This gave me the opportunity to perform a cleaner install the second time around.

    Obviously I should have checkpointed the vm before the upgrade, lesson learned, but when doing the auto-upgrade there was a checkbox saying something along the lines of perform a full backup prior to upgrade. This seemed like the safest option BUT instead it killed the system. If I had opted out of the backup I would have been fine.

    The backup failed because the drive was full yet the update proceeded to kill the installation without doing any sanity checks.

    This post was more of a warning to others and should probably be a bug report but I didn't want to file one without posting here first in case it was a known issue.



  • Also just pointing out the only reason I gave the vm any additional space was to store the squid cache. As a newbie, that seems like a reasonable thing to do.



  • Submitted a bug report. https://redmine.pfsense.org/issues/4549



  • Yeah - A full HD is quite a bug.

    Next time back up the configuration and save it on a desktop somewhere (not the backup while upgrading button)

    Then make a new vm, install a fresh pfsense and restore the config.



  • Filling up the hard drive during a backup isn't the bug. The bug is trying to extract the update while the drive is full and wiping out the kernel leaving the machine unable to boot. The drive had 40% free space when I started the upgrade.

    I've never seen another OS start an upgrade without verifying capacity. Windows, Ubuntu, android, ios, macos all do sanity checks. It's not asking that much. I'm even willing to work on the bug myself and submit a patch.

    Dismissing this out of hand doesn't help.  Remember,I used the UI to kick off the upgrade and chose what appeared to be the conservative path by creating a backup first.  This left the machine in an unusable state.  That is completely unacceptable.



  • If you have alot of access to the VM, do always make a backup of the config on your desktop somewhere so that if you hit a snag you are not buggered.



  • Yeah, I'll definitely be checkpointing in the future.  I've looked through the code and found a couple of things:

    So what happens is the backup fills the drive, all kernels are deleted then the updated archive is extracted.  This eventually fills the space that was freed by removing the original kernels before the new kernel can be extracted.

    Fixing this is complicated by the fact that nano installs may not have free space to extract the entire archive, they rely on overwriting files, so checking for free space to fit the entire archive may mean upgrades are blocked when they could possibly complete.

    I think the easiest way to work around this may be to modify pre_upgrade_command to extract the new kernel into /tmp, verify it is good, then remove old kernels and mv the new /tmp/kernel to /boot/kernel/.  Then if you run out of space extracting the upgrade, you may have mismatched files but you will at least have a kernel file that can boot so you can manually free space and re-extract the archive.

    Thoughts?


  • Banned

    Verifying free space on nanobsd would be utterly pointless. The entire slice is overwritten.


Log in to reply