Multiple update issues going from 2.2.2-Release to 2.3\.


  • Hello all,

    We are having a lot of issues doing an upgrade from 2.2.2 to 2.3.  We are attempting this on two different pfsense appliances. 
    Seeing these issues on both Netgate SG-8860 and C2758.

    1. Loss of all VirtualIP that were ip_alias : [SOLVED - but buggy 'upgrade' or needs to be documented]

    We have HA setup with a shared CARP, and rely heavily on IP_Aliases to make 1:1 NAT work. After update everything appeared fine, but that was because of upstream ARP cache.  After 4-6 hrs everything broke.  We finally realized what this was when realized GUI did not have have any of the VirtualIP 'bound' to any interface.

    Previously the config looked like this

    	 <vip><mode>carp</mode>
    			<interface>wan</interface>
    			<vhid>212</vhid>
    			<advskew>0</advskew>
    			<advbase>1</advbase>
    
    			<type>single</type>
    			<subnet_bits>24</subnet_bits>
    			<subnet>AA.BB.CC.1</subnet></vip> 
    
    		 <vip><mode>ipalias</mode>
    			<interface>wan_vip212</interface>
    
    			<type>single</type>
    			<subnet_bits>24</subnet_bits>
    			<subnet>AA.BB.CC.14</subnet></vip> 
    
    

    After the update, it looks like this

    
    	 <vip><mode>carp</mode>
    			<interface>wan</interface>
    			<vhid>212</vhid>
    			<advskew>0</advskew>
    			<advbase>1</advbase>
    
    			<type>single</type>
    			<subnet_bits>24</subnet_bits>
    			<subnet>AA.BB.CC.1</subnet>
    			<uniqid>571a5af49dfaf</uniqid></vip> 
    		 <vip><mode>ipalias</mode>
    			<interface>wan_vip212</interface>
    
    			<type>single</type>
    			<subnet_bits>24</subnet_bits>
    			<subnet>AA.BB.CC.14</subnet></vip> 
    
    

    With that in place, the GUI shows that the ipalias is not assigned to any interface, so not active. If you edit and use dropdown and assign an interface (WAN VIP), the xml now changes to

    
    		 <vip><mode>ipalias</mode>
    			<interface>_vip571a5af49dfaf</interface>
    
    			<type>single</type>
    			<subnet_bits>24</subnet_bits>
    			<subnet>AA.BB.CC.14</subnet>
    			<uniqid>571a5af49e2ed</uniqid></vip> 
    
    

    So basically for each ipalias, the <interface>needs to change from old nameing scheme of wan_vip212 to _vip571a5af49dfaf

    1. What I assume is similar to #1 , some naming scheme changed in the gateway or DNS … currently I can not resolve DNS.

    If I goto System -> General -> DNS Server Settings, the dropdown to assign to GW is 'none' and has no entries.

    And yes we do have a gateway defined.

    1. Possibly related to that , but our throughput has dropped from 90Mbps to 300Kbps

    2. Possibly related to #2 … randomly/sporadically get dns rebind warnings.

    3. We are also having issues sync'ing between primary/secondary. But will worry about that once we have at least one unit good.

    It seems pretty thoroughly hosed and just considering reset to factory defaults, or just clean install previous version and start from scratch at this point.</interface>


  • The IPAlias upgrade bug got me too except I upgraded the HA pair from pfsense 2.1.5.  I did a backup site first though and discovered it there.  They probably should create a note in the stickies in the Installation and Upgrades section about it.  I just assigned the IP Alias to the correct carp interface and then saved for each IP Alias to fix it and assume you did the same thing.

    I don't know about 2, 3, 4 and haven't tried 5 yet.  I am still testing the secondary HA member at the backup site and still have the primary firewall at 2.1.5 that I can switch to if needed.


  • Well because we have a weird upstream connection, even though we are 'assigned' a /24 we are not truley being routed.  The ISP takes the .1 and uses for their hardware and we are essentially on a switch port off that.  Our firewall is .4 , and we create ~244 ipaliases for .10 - .254 so "1:1" NAT will work correctly. (The pfsense has to respond to arp requests.)

    So yes we manually fixed that, but would have been a PITA through webgui … we downloaded the config.xml , search/replaced, then restored it.

    As to my DNS/Routing issues, some further investigation shows even though under Interfaces->Wan ->IPv4 Upstream Gateway ... there is an entry.  From the command line, netstat -r , shows no default gw is defined.

    So somehow the upgrade completely screwed the routing table.  I assume it similar in the 'naming convention' used in the XML was changed but not properly updated.

    Any one have any thoughts on how to remotely fix that?  I think will have to drive to data center and delete the WAN and gateway config all together and create new ones.


  • The config upgrade issues with VIPs are listed in the "Known regressions" on the release notes, and have been fixed for 2.3.1. But that only impacts things that use VIPs, there are no routing changes involved there. Closest thing to that could be gateway groups specifying CARP IPs if you have multi-WAN.

    Just having a gateway on WAN doesn't mean you have it marked as default. System>Routing, what do you have there? There were no config changes in gateways. Make sure the appropriate gateway is marked as default.

    The fact no gateways are shown at all for DNS servers is odd, that's a first. Knowing what's under System>Routing would be helpful.



  • Ah … on edit screen "disable this gateway" was checked.  That was not how it was prior to up date though.


  • The upgrade does not touch the gateways config at all, and nothing other than admin action will disable a gateway. That was disabled prior to upgrade. Check your config history, Diag>Backup/restore, Config History tab. If it goes back far enough you'll see that. Or if you got a config backup prior to upgrade, check that.


  • Now getting crash report:

    [26-Apr-2016 19:50:39 America/New_York] PHP Fatal error:  Call to undefined function gettext() in /etc/inc/rrd.inc on line 60

    And cpu usage seems to cycle between 12-30% , where previously was usually 5-10%


  • /usr/local/sbin/check_reload_status is at 100%

    ps uxawww

    
    USER      PID  %CPU %MEM    VSZ   RSS TT  STAT STARTED       TIME COMMAND
    root       11 697.9  0.0      0   128  -  RL    7:09PM 3440:14.15 [idle]
    root      291 100.0  0.0  18888  4076  -  RNs   7:09PM  487:07.55 /usr/local/sbin/check_reload_status
    root    76207   0.8  0.4 268196 34532  -  S     3:22AM    0:00.19 php-fpm: pool nginx (php-fpm)
    root        0   0.0  0.0      0   672  -  DLs   7:09PM    0:01.08 [kernel]
    root        1   0.0  0.0   9136   824  -  ILs   7:09PM    0:00.02 /sbin/init --
    root        2   0.0  0.0      0    16  -  DL    7:09PM    0:00.00 [crypto]
    root        3   0.0  0.0      0    16  -  DL    7:09PM    0:00.00 [crypto returns]
    root        4   0.0  0.0      0    48  -  DL    7:09PM    0:00.00 [cam]
    root        5   0.0  0.0      0    16  -  DL    7:09PM    0:03.71 [pf purge]
    root        6   0.0  0.0      0    16  -  DL    7:09PM    0:00.00 [sctp_iterator]
    root        7   0.0  0.0      0    32  -  DL    7:09PM    0:00.26 [pagedaemon]
    root        8   0.0  0.0      0    16  -  DL    7:09PM    0:00.00 [vmdaemon]
    root        9   0.0  0.0      0    16  -  DL    7:09PM    0:00.00 [pagezero]
    root       10   0.0  0.0      0    16  -  DL    7:09PM    0:00.00 [audit]
    root       12   0.0  0.0      0  1008  -  WL    7:09PM    0:33.19 [intr]
    root       13   0.0  0.0      0   128  -  DL    7:09PM    0:00.00 [ng_queue]
    root       14   0.0  0.0      0    48  -  DL    7:09PM    0:00.01 [geom]
    root       15   0.0  0.0      0    16  -  DL    7:09PM    0:05.25 [rand_harvestq]
    root       16   0.0  0.0      0   160  -  DL    7:09PM    0:01.48 [usb]
    root       17   0.0  0.0      0    16  -  DL    7:09PM    0:00.03 [idlepoll]
    root       18   0.0  0.0      0    16  -  DL    7:09PM    0:00.09 [bufdaemon]
    root       19   0.0  0.0      0    16  -  DL    7:09PM    0:00.80 [syncer]
    root       20   0.0  0.0      0    16  -  DL    7:09PM    0:00.08 [vnlru]
    root       53   0.0  0.0      0    16  -  DL    7:09PM    0:00.03 [md0]
    root      292   0.0  0.0  18888  2292  -  IN    7:09PM    0:00.00 check_reload_status: Monitoring daemon of check_reload_status
    root      306   0.0  0.1  13624  4860  -  Ss    7:09PM    0:00.19 /sbin/devd -q
    root     7081   0.0  0.1  59068  6412  -  Ss    7:10PM    0:00.00 /usr/sbin/sshd
    root     7283   0.0  0.0  14612  2108  -  Is    7:10PM    0:00.00 /usr/local/sbin/sshlockout_pf 15
    root    29032   0.0  0.3 268192 26268  -  Ss    7:15PM    0:00.54 php-fpm: master process (/usr/local/lib/php-fpm.conf) (php-fpm)
    root    33941   0.0  0.0  14520  2328  -  Ss    7:10PM    0:15.01 /usr/sbin/syslogd -s -c -c -l /var/dhcpd/var/run/log -P /var/run/syslog.pid -f /var/etc/syslog.conf
    root    35747   0.0  0.0  12268  1876  -  Is    7:10PM    0:00.00 /usr/local/bin/minicron 240 /var/run/ping_hosts.pid /usr/local/bin/ping_hosts.sh
    root    35939   0.0  0.0  12268  1888  -  I     7:10PM    0:00.01 minicron: helper /usr/local/bin/ping_hosts.sh  (minicron)
    root    36346   0.0  0.0  12268  1876  -  Is    7:10PM    0:00.00 /usr/local/bin/minicron 3600 /var/run/expire_accounts.pid /usr/local/sbin/fcgicli -f /etc/rc.expireaccounts
    root    36954   0.0  0.0  12268  1888  -  I     7:10PM    0:00.00 minicron: helper /usr/local/sbin/fcgicli -f /etc/rc.expireaccounts  (minicron)
    root    37018   0.0  0.0  12268  1876  -  Is    7:10PM    0:00.00 /usr/local/bin/minicron 86400 /var/run/update_alias_url_data.pid /usr/local/sbin/fcgicli -f /etc/rc.update_alias_url_data
    root    37598   0.0  0.0  12268  1888  -  I     7:10PM    0:00.00 minicron: helper /usr/local/sbin/fcgicli -f /etc/rc.update_alias_url_data  (minicron)
    root    51427   0.0  0.0  14612  2180  -  Is    7:10PM    0:00.00 /usr/local/sbin/sshlockout_pf 15
    root    61609   0.0  0.0  16676  2328  -  Ss    7:10PM    0:18.68 /usr/local/sbin/filterlog -i pflog0 -p /var/run/filterlog.pid
    root    61633   0.0  0.0   8168  1828  -  IN    3:21AM    0:00.00 sleep 60
    root    62234   0.0  0.0  18896  2412  -  Is    7:10PM    0:00.00 /usr/local/sbin/xinetd -syslog daemon -f /var/etc/xinetd.conf -pidfile /var/run/xinetd.pid
    root    65026   0.0  0.0  19108  2304  -  Is    7:10PM    0:03.60 /usr/local/bin/dpinger -S -r 0 -i WANGW -B 208.72.232.5 -p /var/run/dpinger_WANGW_208.72.232.5_208.72.232.1.pid -u /var/run/dpinger_WANGW_208.72.232.5_208.72.232.1.sock -C /etc/rc.gateway_alarm -d 0 -s 500 -l 2000 -t 60000 -A 1000 -D 500 -L 20 208.72.232.1
    root    70242   0.0  0.1  46196  6936  -  Is    7:10PM    0:00.00 nginx: master process /usr/local/sbin/nginx -c /var/etc/nginx-webConfigurator.conf (nginx)
    root    70398   0.0  0.1  46196  8048  -  S     7:10PM    0:01.07 nginx: worker process (nginx)
    root    70508   0.0  0.1  46196  7900  -  I     7:10PM    0:00.25 nginx: worker process (nginx)
    root    70869   0.0  0.0  16532  2260  -  Is    7:10PM    0:00.03 /usr/sbin/cron -s
    unbound 72696   0.0  0.4  92768 33500  -  Is    7:10PM    0:13.39 /usr/local/sbin/unbound -c /var/unbound/unbound.conf
    root    73327   0.0  0.2  30140 17968  -  Ss    7:10PM    0:42.25 /usr/local/sbin/ntpd -g -c /var/etc/ntpd.conf -p /var/run/ntpd.pid
    root    73468   0.0  0.1  82268  7416  -  Ss    3:22AM    0:00.03 sshd: admin@pts/0 (sshd)
    root    50427   0.0  0.0  43440  2672 u0  Is    7:10PM    0:00.01 login [pam] (login)
    root    51588   0.0  0.0  17000  2640 u0  I     7:10PM    0:00.00 -sh (sh)
    root    51799   0.0  0.0  17000  2528 u0  I+    7:10PM    0:00.00 /bin/sh /etc/rc.initial
    root    82902   0.0  0.0  17000  2368 u0- IN    7:10PM    0:04.14 /bin/sh /var/db/rrd/updaterrd.sh
    root    50198   0.0  0.0  43440  2672 v0  Is    7:10PM    0:00.01 login [pam] (login)
    root    51582   0.0  0.0  17000  2640 v0  I     7:10PM    0:00.00 -sh (sh)
    root    51912   0.0  0.0  17000  2528 v0  I+    7:10PM    0:00.00 /bin/sh /etc/rc.initial
    root    74706   0.0  0.0  17000  2532  0  Ss    3:22AM    0:00.01 /bin/sh /etc/rc.initial
    root    77466   0.0  0.0  17340  3468  0  S     3:22AM    0:00.01 /bin/tcsh
    root    80136   0.0  0.0  18676  2248  0  R+    3:22AM    0:00.00 ps uxawww
    
    

  • @sforsythe:

    Now getting crash report:

    [26-Apr-2016 19:50:39 America/New_York] PHP Fatal error:  Call to undefined function gettext() in /etc/inc/rrd.inc on line 60

    is the timestamp on that from the time when it was upgraded? Though that shouldn't generate a crash report, something like that would be expected post-upgrade but pre-reboot.

    If check_reload_status is still chewing up that much CPU, get truss output on it. Assuming its PID is still 291, run:

    truss -o /root/checkreload-truss.txt -p 291
    

    Let that run for about 10 seconds (it'll probably get big quckly), then hit ctrl-c. Then 'kill -9 291' to get rid of it and at least temporarily if not permanently fix that. Then if you can download that checkreload-truss.txt file and send it to me, email to cmb at pfsense dot org, with a link to this thread.


  • Hi cmb,

    The time stamp was right after a reboot after re-enabling the 'default' gateway to see if would clean up any issues. 
    The update was ~7days ago.  There have been a few crash reports since then which I did forward to dev team (via the gui request), but I did not save the text on those (fairly certain they were different)

    I'll get the output requested emailed to you.

    Thanks for taking a look.
    shane


  • I just got the same error again when using the webgui and made the following change
    System->General Setup -> checked "Do not use the DNS Forwarders as a DNS server for the firewall"

    
    Crash report begins.  Anonymous machine information:
    amd64
    10.3-RELEASE
    FreeBSD 10.3-RELEASE #4 05adf0a(RELENG_2_3_0): Mon Apr 11 19:09:19 CDT 2016     root@factory23-amd64-builder:/builder/factory-230/tmp/obj/builder/factory-230/tmp/FreeBSD-src/sys/pfSense
    
    Crash report details:
    PHP Errors:
    [27-Apr-2016 12:25:24 America/New_York] PHP Fatal error:  Call to undefined function gettext() in /etc/inc/rrd.inc on line 60
    [27-Apr-2016 12:25:24 America/New_York] PHP Fatal error:  Call to undefined function gettext() in /etc/inc/rrd.inc on line 60
    [27-Apr-2016 12:25:24 America/New_York] PHP Fatal error:  Call to undefined function gettext() in /etc/inc/rrd.inc on line 60
    [27-Apr-2016 12:25:24 America/New_York] PHP Fatal error:  Call to undefined function gettext() in /etc/inc/rrd.inc on line 60
    [27-Apr-2016 12:29:56 America/New_York] PHP Fatal error:  Call to undefined function gettext() in /etc/inc/rrd.inc on line 60
    [27-Apr-2016 12:29:56 America/New_York] PHP Fatal error:  Call to undefined function gettext() in /etc/inc/rrd.inc on line 60
    [27-Apr-2016 12:29:57 America/New_York] PHP Fatal error:  Call to undefined function gettext() in /etc/inc/rrd.inc on line 60
    [27-Apr-2016 12:29:57 America/New_York] PHP Fatal error:  Call to undefined function gettext() in /etc/inc/rrd.inc on line 60
    [27-Apr-2016 12:33:03 America/New_York] PHP Fatal error:  Call to undefined function gettext() in /etc/inc/rrd.inc on line 60
    [27-Apr-2016 12:33:03 America/New_York] PHP Fatal error:  Call to undefined function gettext() in /etc/inc/rrd.inc on line 60
    

  • Update:

    Issue seems related to HA Setup … Something in how the upgrade converted the CARP settings or the pfsync internals was causing network contention.

    On both datacenters we were seeing this issue, when we use the remote APC to turn power off to the secondary units, after ~1 min downloads almost instantly improved.

    Will need to goto the datacenter to investigate more....


  • When they are both up I would do an ifconfig on both systems to make sure the IP Aliases are only up on one system.  Maybe the IP Aliases are assigned to WAN instead of the CARP IP on one of your systems.  I did that on accident when I initially fixed mine from the upgrade bug.  I then realized I assigned them to WAN instead of my carp IP.  Worth a shot just in case anyway as I expect that to cause all kinds of crazy network issues if the IPs are up on both systems.


  • sforsythe: I replied back on your ticket but wanted to reply here as well to make sure you see it.

    adam65535's theory sounds plausible. The fact that turning off the secondary makes the problem stop indicates a problem along the lines of an IP conflict, which is what the scenario he describes ends up being. It's something similar along those lines.