Multiple update issues going from 2.2.2-Release to 2.3\.
-
Hello all,
We are having a lot of issues doing an upgrade from 2.2.2 to 2.3. We are attempting this on two different pfsense appliances.
Seeing these issues on both Netgate SG-8860 and C2758.- Loss of all VirtualIP that were ip_alias : [SOLVED - but buggy 'upgrade' or needs to be documented]
We have HA setup with a shared CARP, and rely heavily on IP_Aliases to make 1:1 NAT work. After update everything appeared fine, but that was because of upstream ARP cache. After 4-6 hrs everything broke. We finally realized what this was when realized GUI did not have have any of the VirtualIP 'bound' to any interface.
Previously the config looked like this
<vip><mode>carp</mode> <interface>wan</interface> <vhid>212</vhid> <advskew>0</advskew> <advbase>1</advbase> <type>single</type> <subnet_bits>24</subnet_bits> <subnet>AA.BB.CC.1</subnet></vip> <vip><mode>ipalias</mode> <interface>wan_vip212</interface> <type>single</type> <subnet_bits>24</subnet_bits> <subnet>AA.BB.CC.14</subnet></vip>
After the update, it looks like this
<vip><mode>carp</mode> <interface>wan</interface> <vhid>212</vhid> <advskew>0</advskew> <advbase>1</advbase> <type>single</type> <subnet_bits>24</subnet_bits> <subnet>AA.BB.CC.1</subnet> <uniqid>571a5af49dfaf</uniqid></vip> <vip><mode>ipalias</mode> <interface>wan_vip212</interface> <type>single</type> <subnet_bits>24</subnet_bits> <subnet>AA.BB.CC.14</subnet></vip>
With that in place, the GUI shows that the ipalias is not assigned to any interface, so not active. If you edit and use dropdown and assign an interface (WAN VIP), the xml now changes to
<vip><mode>ipalias</mode> <interface>_vip571a5af49dfaf</interface> <type>single</type> <subnet_bits>24</subnet_bits> <subnet>AA.BB.CC.14</subnet> <uniqid>571a5af49e2ed</uniqid></vip>
So basically for each ipalias, the <interface>needs to change from old nameing scheme of wan_vip212 to _vip571a5af49dfaf
- What I assume is similar to #1 , some naming scheme changed in the gateway or DNS … currently I can not resolve DNS.
If I goto System -> General -> DNS Server Settings, the dropdown to assign to GW is 'none' and has no entries.
And yes we do have a gateway defined.
-
Possibly related to that , but our throughput has dropped from 90Mbps to 300Kbps
-
Possibly related to #2 … randomly/sporadically get dns rebind warnings.
-
We are also having issues sync'ing between primary/secondary. But will worry about that once we have at least one unit good.
It seems pretty thoroughly hosed and just considering reset to factory defaults, or just clean install previous version and start from scratch at this point.</interface>
-
The IPAlias upgrade bug got me too except I upgraded the HA pair from pfsense 2.1.5. I did a backup site first though and discovered it there. They probably should create a note in the stickies in the Installation and Upgrades section about it. I just assigned the IP Alias to the correct carp interface and then saved for each IP Alias to fix it and assume you did the same thing.
I don't know about 2, 3, 4 and haven't tried 5 yet. I am still testing the secondary HA member at the backup site and still have the primary firewall at 2.1.5 that I can switch to if needed.
-
Well because we have a weird upstream connection, even though we are 'assigned' a /24 we are not truley being routed. The ISP takes the .1 and uses for their hardware and we are essentially on a switch port off that. Our firewall is .4 , and we create ~244 ipaliases for .10 - .254 so "1:1" NAT will work correctly. (The pfsense has to respond to arp requests.)
So yes we manually fixed that, but would have been a PITA through webgui … we downloaded the config.xml , search/replaced, then restored it.
As to my DNS/Routing issues, some further investigation shows even though under Interfaces->Wan ->IPv4 Upstream Gateway ... there is an entry. From the command line, netstat -r , shows no default gw is defined.
So somehow the upgrade completely screwed the routing table. I assume it similar in the 'naming convention' used in the XML was changed but not properly updated.
Any one have any thoughts on how to remotely fix that? I think will have to drive to data center and delete the WAN and gateway config all together and create new ones.
-
The config upgrade issues with VIPs are listed in the "Known regressions" on the release notes, and have been fixed for 2.3.1. But that only impacts things that use VIPs, there are no routing changes involved there. Closest thing to that could be gateway groups specifying CARP IPs if you have multi-WAN.
Just having a gateway on WAN doesn't mean you have it marked as default. System>Routing, what do you have there? There were no config changes in gateways. Make sure the appropriate gateway is marked as default.
The fact no gateways are shown at all for DNS servers is odd, that's a first. Knowing what's under System>Routing would be helpful.
-
-
Ah … on edit screen "disable this gateway" was checked. That was not how it was prior to up date though.
-
The upgrade does not touch the gateways config at all, and nothing other than admin action will disable a gateway. That was disabled prior to upgrade. Check your config history, Diag>Backup/restore, Config History tab. If it goes back far enough you'll see that. Or if you got a config backup prior to upgrade, check that.
-
Now getting crash report:
[26-Apr-2016 19:50:39 America/New_York] PHP Fatal error: Call to undefined function gettext() in /etc/inc/rrd.inc on line 60
And cpu usage seems to cycle between 12-30% , where previously was usually 5-10%
-
/usr/local/sbin/check_reload_status is at 100%
ps uxawww
USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND root 11 697.9 0.0 0 128 - RL 7:09PM 3440:14.15 [idle] root 291 100.0 0.0 18888 4076 - RNs 7:09PM 487:07.55 /usr/local/sbin/check_reload_status root 76207 0.8 0.4 268196 34532 - S 3:22AM 0:00.19 php-fpm: pool nginx (php-fpm) root 0 0.0 0.0 0 672 - DLs 7:09PM 0:01.08 [kernel] root 1 0.0 0.0 9136 824 - ILs 7:09PM 0:00.02 /sbin/init -- root 2 0.0 0.0 0 16 - DL 7:09PM 0:00.00 [crypto] root 3 0.0 0.0 0 16 - DL 7:09PM 0:00.00 [crypto returns] root 4 0.0 0.0 0 48 - DL 7:09PM 0:00.00 [cam] root 5 0.0 0.0 0 16 - DL 7:09PM 0:03.71 [pf purge] root 6 0.0 0.0 0 16 - DL 7:09PM 0:00.00 [sctp_iterator] root 7 0.0 0.0 0 32 - DL 7:09PM 0:00.26 [pagedaemon] root 8 0.0 0.0 0 16 - DL 7:09PM 0:00.00 [vmdaemon] root 9 0.0 0.0 0 16 - DL 7:09PM 0:00.00 [pagezero] root 10 0.0 0.0 0 16 - DL 7:09PM 0:00.00 [audit] root 12 0.0 0.0 0 1008 - WL 7:09PM 0:33.19 [intr] root 13 0.0 0.0 0 128 - DL 7:09PM 0:00.00 [ng_queue] root 14 0.0 0.0 0 48 - DL 7:09PM 0:00.01 [geom] root 15 0.0 0.0 0 16 - DL 7:09PM 0:05.25 [rand_harvestq] root 16 0.0 0.0 0 160 - DL 7:09PM 0:01.48 [usb] root 17 0.0 0.0 0 16 - DL 7:09PM 0:00.03 [idlepoll] root 18 0.0 0.0 0 16 - DL 7:09PM 0:00.09 [bufdaemon] root 19 0.0 0.0 0 16 - DL 7:09PM 0:00.80 [syncer] root 20 0.0 0.0 0 16 - DL 7:09PM 0:00.08 [vnlru] root 53 0.0 0.0 0 16 - DL 7:09PM 0:00.03 [md0] root 292 0.0 0.0 18888 2292 - IN 7:09PM 0:00.00 check_reload_status: Monitoring daemon of check_reload_status root 306 0.0 0.1 13624 4860 - Ss 7:09PM 0:00.19 /sbin/devd -q root 7081 0.0 0.1 59068 6412 - Ss 7:10PM 0:00.00 /usr/sbin/sshd root 7283 0.0 0.0 14612 2108 - Is 7:10PM 0:00.00 /usr/local/sbin/sshlockout_pf 15 root 29032 0.0 0.3 268192 26268 - Ss 7:15PM 0:00.54 php-fpm: master process (/usr/local/lib/php-fpm.conf) (php-fpm) root 33941 0.0 0.0 14520 2328 - Ss 7:10PM 0:15.01 /usr/sbin/syslogd -s -c -c -l /var/dhcpd/var/run/log -P /var/run/syslog.pid -f /var/etc/syslog.conf root 35747 0.0 0.0 12268 1876 - Is 7:10PM 0:00.00 /usr/local/bin/minicron 240 /var/run/ping_hosts.pid /usr/local/bin/ping_hosts.sh root 35939 0.0 0.0 12268 1888 - I 7:10PM 0:00.01 minicron: helper /usr/local/bin/ping_hosts.sh (minicron) root 36346 0.0 0.0 12268 1876 - Is 7:10PM 0:00.00 /usr/local/bin/minicron 3600 /var/run/expire_accounts.pid /usr/local/sbin/fcgicli -f /etc/rc.expireaccounts root 36954 0.0 0.0 12268 1888 - I 7:10PM 0:00.00 minicron: helper /usr/local/sbin/fcgicli -f /etc/rc.expireaccounts (minicron) root 37018 0.0 0.0 12268 1876 - Is 7:10PM 0:00.00 /usr/local/bin/minicron 86400 /var/run/update_alias_url_data.pid /usr/local/sbin/fcgicli -f /etc/rc.update_alias_url_data root 37598 0.0 0.0 12268 1888 - I 7:10PM 0:00.00 minicron: helper /usr/local/sbin/fcgicli -f /etc/rc.update_alias_url_data (minicron) root 51427 0.0 0.0 14612 2180 - Is 7:10PM 0:00.00 /usr/local/sbin/sshlockout_pf 15 root 61609 0.0 0.0 16676 2328 - Ss 7:10PM 0:18.68 /usr/local/sbin/filterlog -i pflog0 -p /var/run/filterlog.pid root 61633 0.0 0.0 8168 1828 - IN 3:21AM 0:00.00 sleep 60 root 62234 0.0 0.0 18896 2412 - Is 7:10PM 0:00.00 /usr/local/sbin/xinetd -syslog daemon -f /var/etc/xinetd.conf -pidfile /var/run/xinetd.pid root 65026 0.0 0.0 19108 2304 - Is 7:10PM 0:03.60 /usr/local/bin/dpinger -S -r 0 -i WANGW -B 208.72.232.5 -p /var/run/dpinger_WANGW_208.72.232.5_208.72.232.1.pid -u /var/run/dpinger_WANGW_208.72.232.5_208.72.232.1.sock -C /etc/rc.gateway_alarm -d 0 -s 500 -l 2000 -t 60000 -A 1000 -D 500 -L 20 208.72.232.1 root 70242 0.0 0.1 46196 6936 - Is 7:10PM 0:00.00 nginx: master process /usr/local/sbin/nginx -c /var/etc/nginx-webConfigurator.conf (nginx) root 70398 0.0 0.1 46196 8048 - S 7:10PM 0:01.07 nginx: worker process (nginx) root 70508 0.0 0.1 46196 7900 - I 7:10PM 0:00.25 nginx: worker process (nginx) root 70869 0.0 0.0 16532 2260 - Is 7:10PM 0:00.03 /usr/sbin/cron -s unbound 72696 0.0 0.4 92768 33500 - Is 7:10PM 0:13.39 /usr/local/sbin/unbound -c /var/unbound/unbound.conf root 73327 0.0 0.2 30140 17968 - Ss 7:10PM 0:42.25 /usr/local/sbin/ntpd -g -c /var/etc/ntpd.conf -p /var/run/ntpd.pid root 73468 0.0 0.1 82268 7416 - Ss 3:22AM 0:00.03 sshd: admin@pts/0 (sshd) root 50427 0.0 0.0 43440 2672 u0 Is 7:10PM 0:00.01 login [pam] (login) root 51588 0.0 0.0 17000 2640 u0 I 7:10PM 0:00.00 -sh (sh) root 51799 0.0 0.0 17000 2528 u0 I+ 7:10PM 0:00.00 /bin/sh /etc/rc.initial root 82902 0.0 0.0 17000 2368 u0- IN 7:10PM 0:04.14 /bin/sh /var/db/rrd/updaterrd.sh root 50198 0.0 0.0 43440 2672 v0 Is 7:10PM 0:00.01 login [pam] (login) root 51582 0.0 0.0 17000 2640 v0 I 7:10PM 0:00.00 -sh (sh) root 51912 0.0 0.0 17000 2528 v0 I+ 7:10PM 0:00.00 /bin/sh /etc/rc.initial root 74706 0.0 0.0 17000 2532 0 Ss 3:22AM 0:00.01 /bin/sh /etc/rc.initial root 77466 0.0 0.0 17340 3468 0 S 3:22AM 0:00.01 /bin/tcsh root 80136 0.0 0.0 18676 2248 0 R+ 3:22AM 0:00.00 ps uxawww
-
Now getting crash report:
[26-Apr-2016 19:50:39 America/New_York] PHP Fatal error: Call to undefined function gettext() in /etc/inc/rrd.inc on line 60
is the timestamp on that from the time when it was upgraded? Though that shouldn't generate a crash report, something like that would be expected post-upgrade but pre-reboot.
If check_reload_status is still chewing up that much CPU, get truss output on it. Assuming its PID is still 291, run:
truss -o /root/checkreload-truss.txt -p 291
Let that run for about 10 seconds (it'll probably get big quckly), then hit ctrl-c. Then 'kill -9 291' to get rid of it and at least temporarily if not permanently fix that. Then if you can download that checkreload-truss.txt file and send it to me, email to cmb at pfsense dot org, with a link to this thread.
-
Hi cmb,
The time stamp was right after a reboot after re-enabling the 'default' gateway to see if would clean up any issues.
The update was ~7days ago. There have been a few crash reports since then which I did forward to dev team (via the gui request), but I did not save the text on those (fairly certain they were different)I'll get the output requested emailed to you.
Thanks for taking a look.
shane -
I just got the same error again when using the webgui and made the following change
System->General Setup -> checked "Do not use the DNS Forwarders as a DNS server for the firewall"Crash report begins. Anonymous machine information: amd64 10.3-RELEASE FreeBSD 10.3-RELEASE #4 05adf0a(RELENG_2_3_0): Mon Apr 11 19:09:19 CDT 2016 root@factory23-amd64-builder:/builder/factory-230/tmp/obj/builder/factory-230/tmp/FreeBSD-src/sys/pfSense Crash report details: PHP Errors: [27-Apr-2016 12:25:24 America/New_York] PHP Fatal error: Call to undefined function gettext() in /etc/inc/rrd.inc on line 60 [27-Apr-2016 12:25:24 America/New_York] PHP Fatal error: Call to undefined function gettext() in /etc/inc/rrd.inc on line 60 [27-Apr-2016 12:25:24 America/New_York] PHP Fatal error: Call to undefined function gettext() in /etc/inc/rrd.inc on line 60 [27-Apr-2016 12:25:24 America/New_York] PHP Fatal error: Call to undefined function gettext() in /etc/inc/rrd.inc on line 60 [27-Apr-2016 12:29:56 America/New_York] PHP Fatal error: Call to undefined function gettext() in /etc/inc/rrd.inc on line 60 [27-Apr-2016 12:29:56 America/New_York] PHP Fatal error: Call to undefined function gettext() in /etc/inc/rrd.inc on line 60 [27-Apr-2016 12:29:57 America/New_York] PHP Fatal error: Call to undefined function gettext() in /etc/inc/rrd.inc on line 60 [27-Apr-2016 12:29:57 America/New_York] PHP Fatal error: Call to undefined function gettext() in /etc/inc/rrd.inc on line 60 [27-Apr-2016 12:33:03 America/New_York] PHP Fatal error: Call to undefined function gettext() in /etc/inc/rrd.inc on line 60 [27-Apr-2016 12:33:03 America/New_York] PHP Fatal error: Call to undefined function gettext() in /etc/inc/rrd.inc on line 60
-
Update:
Issue seems related to HA Setup … Something in how the upgrade converted the CARP settings or the pfsync internals was causing network contention.
On both datacenters we were seeing this issue, when we use the remote APC to turn power off to the secondary units, after ~1 min downloads almost instantly improved.
Will need to goto the datacenter to investigate more....
-
When they are both up I would do an ifconfig on both systems to make sure the IP Aliases are only up on one system. Maybe the IP Aliases are assigned to WAN instead of the CARP IP on one of your systems. I did that on accident when I initially fixed mine from the upgrade bug. I then realized I assigned them to WAN instead of my carp IP. Worth a shot just in case anyway as I expect that to cause all kinds of crazy network issues if the IPs are up on both systems.
-
sforsythe: I replied back on your ticket but wanted to reply here as well to make sure you see it.
adam65535's theory sounds plausible. The fact that turning off the secondary makes the problem stop indicates a problem along the lines of an IP conflict, which is what the scenario he describes ends up being. It's something similar along those lines.