Absolute Worst Experience Trying to Upgrade to 2.4.4_3



  • The attempt to upgrade from 2.4.4_2 to 2.4.4_3 on an APU1 (AMD G-T40E, mSATA SSD) failed with several fatal problems.

    1. after the unit rebooted after the first step of the upgrade, it produced a page fault:
    Fatal trap 12: page fault while in kernel mode
    cpuid = 1; apic id = 01
    fault virtual address   = 0x800e29000
    fault code              = supervisor write data, page not present
    instruction pointer     = 0x20:0xffffffff81189dde
    stack pointer           = 0x28:0xfffffe0093dd3940
    frame pointer           = 0x28:0xfffffe0093dd3940
    code segment            = base 0x0, limit 0xfffff, type 0x1b
                            = DPL 0, pres 1, long 1, def32 0, gran 1
    processor eflags        = interrupt enabled, resume, IOPL = 0
    current process         = 748 (logger)
    

    This was right after some messages about microcode updating. This also corrupted the root file system beyond repair. I reinstalled from scratch, which went fine.

    1. Trying to restore the previous config resulted in a broken system

    This is even worse. Every attempt to load the configuration I had saved a couple of hours ago on the running 2.4.4_2 system failed and permanently damaged the freshly installed system. The web configurator would only spit out error messages in my browser after that and not recover.

    Fatal error: Uncaught Exception: XML error: RRDDATA at line 3995 cannot occur more than once in /etc/inc/xmlparse.inc:87 Stack trace: #0 [internal function]: startElement(Resource id #6, 'RRDDATA', Array) #1 /etc/inc/xmlparse.inc(186): xml_parse(Resource id #6, '/UH252lklZZqR63...', false) #2 /etc/inc/xmlparse.inc(147): parse_xml_config_raw('/conf/config.xm...', Array, 'false') #3 /etc/inc/config.lib.inc(132): parse_xml_config('/conf/config.xm...', Array) #4 /etc/inc/config.inc(159): parse_config() #5 /etc/inc/gwlb.inc(23): require_once('/etc/inc/config...') #6 /etc/inc/functions.inc(33): require_once('/etc/inc/gwlb.i...') #7 /etc/inc/notices.inc(24): require_once('/etc/inc/functi...') #8 /etc/inc/config.gui.inc(37): require_once('/etc/inc/notice...') #9 /etc/inc/auth.inc(31): require_once('/etc/inc/config...') #10 /etc/inc/authgui.inc(25): include_once('/etc/inc/auth.i...') #11 /usr/local/www/guiconfig.inc(51): require_once('/etc/inc/authgu...') #12 /usr/local/www/index.php(44): require_once('/usr/local/www/...') #13 {m in /etc/inc/xmlparse.inc on line 87 PHP ERROR: Type: 1, File: /etc/inc/xmlparse.inc, Line: 87, Message: Uncaught Exception: XML error: RRDDATA at line 3995 cannot occur more than once in /etc/inc/xmlparse.inc:87 Stack trace: #0 [internal function]: startElement(Resource id #6, 'RRDDATA', Array) #1 /etc/inc/xmlparse.inc(186): xml_parse(Resource id #6, '/UH252lklZZqR63...', false) #2 /etc/inc/xmlparse.inc(147): parse_xml_config_raw('/conf/config.xm...', Array, 'false') #3 /etc/inc/config.lib.inc(132): parse_xml_config('/conf/config.xm...', Array) #4 /etc/inc/config.inc(159): parse_config() #5 /etc/inc/gwlb.inc(23): require_once('/etc/inc/config...') #6 /etc/inc/functions.inc(33): require_once('/etc/inc/gwlb.i...') #7 /etc/inc/notices.inc(24): require_once('/etc/inc/functi...') #8 /etc/inc/config.gui.inc(37): require_once('/etc/inc/notice...') #9 /etc/inc/auth.inc(31): require_once('/etc/inc/config...') #10 /etc/inc/authgui.inc(25): include_once('/etc/inc/auth.i...') #11 /usr/local/www/guiconfig.inc(51): require_once('/etc/inc/authgu...') #12 /usr/local/www/index.php(44): require_once('/usr/local/www/...') #13 {m
    

    I mean, even if an uploaded config file had been corrupt (which it wasn't), it must never damage the running system (think of DOS, etc). It could simply say that the file was corrupt, and refuse to load it; this seems rather obvious, doesn't it?
    The problem turned out to be the long rrd section in the file; after I edited that out, I could finally load the config.

    The system isn't very robust when it comes to the config file in general. This is horrible, too:

    Enter an option: 4
    
    Config.xml is corrupted and is 0 bytes.  Could not restore a previous backup.Config.xml is corrupted and is 0 bytes.  Could not restore a previous backup.
     0) Logout (SSH only)                  9) pfTop
     1) Assign Interfaces                 10) Filter Logs
     2) Set interface(s) IP address       11) Restart webConfigurator
     3) Reset webConfigurator password    12) PHP shell + pfSense tools
     4) Reset to factory defaults         13) Update from console
     5) Reboot system                     14) Enable Secure Shell (sshd)
     6) Halt system                       15) Restore recent configuration
     7) Ping host                         16) Restart PHP-FPM
     8) Shell
    

    So it doesn't seem to keep a basic factory config somewhere for such cases. Restoring to factory defaults should always work, right?

    Roman



  • @rmaeder said in Absolute Worst Experience Trying to Upgrade to 2.4.4_3:

    I mean, even if an uploaded config file had been corrupt (which it wasn't),

    But :

    @rmaeder said in Absolute Worst Experience Trying to Upgrade to 2.4.4_3:

    The problem turned out to be the long rrd section in the file

    so it was a corrupt config file. The fact that it is huge (mine are 720 MB) is normal : that happens when RRD files are included.
    But they shouldn't have multiple identical sections "RRDDATA at line 3995 cannot occur more than once " (same RRD info).

    This can happens if the files on disk are in bad shape.

    Restart pfSense using the console, stop booting at that the first menu and launch some "fsck" (see Manual for all this)
    I guess you will find issues with the disk structure.
    Running Windows or any ther disk with a disk in the same shape will produce the same results.

    @rmaeder said in Absolute Worst Experience Trying to Upgrade to 2.4.4_3:

    Restoring to factory defaults should always work, right?

    This is NOT a desktop OS !!!!

    Default settings exists but not for attributing which interface goes to what NIC.
    Think about it 1 minute and you'll understand why.
    Hint : this is a firewall.


  • Rebel Alliance Developer Netgate

    Do you still have the full backtrace for that crash you posted? There isn't enough to go on to accurately speculate about causes without that.

    That said, from the general description of symptoms and the other errors I'd say you have a disk problem.



  • @Gertjan said in Absolute Worst Experience Trying to Upgrade to 2.4.4_3:

    @rmaeder said in Absolute Worst Experience Trying to Upgrade to 2.4.4_3:

    I mean, even if an uploaded config file had been corrupt (which it wasn't),

    But :

    @rmaeder said in Absolute Worst Experience Trying to Upgrade to 2.4.4_3:

    The problem turned out to be the long rrd section in the file

    so it was a corrupt config file. The fact that it is huge (mine are 720 MB) is normal : that happens when RRD files are included.
    But they shouldn't have multiple identical sections "RRDDATA at line 3995 cannot occur more than once " (same RRD info).

    This can happens if the files on disk are in bad shape.

    My config backups were only about 6MB.
    But the fact that pfSense can produce backups it cannot read later on is really bad, isn't it?
    The fact that it contains two rrd sections shouldn't be fatal.

    Restart pfSense using the console, stop booting at that the first menu and launch some "fsck" (see Manual for all this)
    I guess you will find issues with the disk structure.
    Running Windows or any ther disk with a disk in the same shape will produce the same results.

    @rmaeder said in Absolute Worst Experience Trying to Upgrade to 2.4.4_3:

    Restoring to factory defaults should always work, right?

    This is NOT a desktop OS !!!!

    Default settings exists but not for attributing which interface goes to what NIC.
    Think about it 1 minute and you'll understand why.
    Hint : this is a firewall.

    Closed firewall/router products usually have a button you can press with a paper clip that restores things.
    For pfSense this isn't feasible, I admit, and we could just consider a reinstall to be some kind of factory reset. I wouldn't have made a fuss about it, if the backup had worked.

    Roman



  • @jimp said in Absolute Worst Experience Trying to Upgrade to 2.4.4_3:

    Do you still have the full backtrace for that crash you posted? There isn't enough to go on to accurately speculate about causes without that.

    The terminal program I used to talk to the serial console of the unit didn't support scrolling, so I don't have more than I posted.

    That said, from the general description of symptoms and the other errors I'd say you have a disk problem.

    Whatever it was, I eventually reinstalled 2.4.4_3 on both of my units, and restored the hand-edited backup (without the rrd sections). I just downloaded a config backup after about two days of operations and it contains a single <rrddata> section with a couple of (distinct) files.

    From now on, I'll always make two backups, one without the rrd data, just in case.

    Roman


  • Netgate Administrator

    Did it actually have more than one rrddata section in it?
    There was a bug for that but it was fixed some time ago: https://redmine.pfsense.org/issues/8994

    I agree I would expect it to be able to restore the factory settings from any config. The default config is stored in /conf.default/config.xml. Unless that was also damaged I can only think the lack of a current good config or any backup configs (other than default) caused it to hit an error before it actually restored the default.

    Steve



  • @stephenw10 said in Absolute Worst Experience Trying to Upgrade to 2.4.4_3:

    Did it actually have more than one rrddata section in it?
    There was a bug for that but it was fixed some time ago: https://redmine.pfsense.org/issues/8994

    I've looked at all my older backups. Here is the situation:

    A backup of 20181204 (before upgrading to 2.4.4_1) contains just one rrddata section.
    On 20181211 I replaced the SD card with an mSATA SSD and reinstalled pfSense. The backup made before also contains one rrddata section.

    However, a backup I made on 20190109, before upgrading to 2.4.4_2, contains two such sections. The upgrade went fine, so I never used that backup.
    All backups since then contain two rrddata sections, which I didn't notice until I tried to load one after the failed 2.4.4_3 upgrade.

    The symptoms are similar to what the bug mentions, but that should have been fixed in 2.4.4_1?

    Roman



  • @rmaeder said in Absolute Worst Experience Trying to Upgrade to 2.4.4_3:

    but that should have been fixed in 2.4.4_1?

    Yes, it was, but fix didn't clean your existing config file.

    Open a console, option 8 and use

    viconfig
    

    Add an empty

    	<donothing>
    	</donothing>
    

    And save the config (: wq).
    From now on, it will be part of your config file, it will last in there.

    Note that reading in the existing config while booting doesn't break things, but importing does (when a double rrddata exists).

    Also : https://github.com/pfsense/pfsense/blob/master/src/usr/local/www/diag_backup.php#L204 looks nice, but I guess (again - I'm just thinking out loud here) that a second <rrddata></rrddata> isn't removed. So, in a config backup, selected without rrddata you had always an empty rrddata section, that silently failed when restoring the actually RRD data files on disk when booting pfSense.
    Actually exporting rrd data :

    57f1302b-8da6-490b-b8d5-55cad482b471-image.png

    and you wind up having two rrddata section again, and that went wrong.


Log in to reply