Upgrade success (Edit: not quite)

phil.davis

With pfSense 2.1-RELEASE and exactly the same config as the post above, the results are good and log messages are quite different:
At the start, with all links up, /tmp/rules.debug has:

GWNcellPriority = "  route-to { ( vr2 10.49.174.250 )  }  "

Pull OPT1 cable:
System Log:

Jan 18 22:50:59 	kernel: vr2: link state changed to DOWN
Jan 18 22:51:03 	php: rc.linkup: Hotplug event detected for OPT1(opt1) but ignoring since interface is configured with static IP (10.49.174.1 )
Jan 18 22:51:38 	check_reload_status: updating dyndns OPT1GW
Jan 18 22:51:38 	check_reload_status: Restarting ipsec tunnels
Jan 18 22:51:38 	check_reload_status: Restarting OpenVPN tunnels/interfaces
Jan 18 22:51:38 	check_reload_status: Reloading filter

Gateways Log:

Jan 18 22:51:28 	apinger: ALARM: OPT1GW(8.8.4.4) *** OPT1GWdown ***

/tmp/rules.debug changes within 1 minute to have:

GWNcellPriority = "  route-to { ( vr1 10.49.175.250 )  }  "

checkip has changed to the WAN public IP - good
Internet browsing has failed over to WAN inside 1 minute - good

Replace OPT1 cable:
System Log:

Jan 18 22:54:19 	check_reload_status: rc.newwanip starting vr2
Jan 18 22:54:25 	php: rc.newwanip: rc.newwanip: Informational is starting vr2.
Jan 18 22:54:25 	php: rc.newwanip: rc.newwanip: on (IP address: 10.49.174.1) (interface: opt1) (real interface: vr2).
Jan 18 22:54:25 	php: rc.newwanip: Removing static route for monitor 8.8.8.8 and adding a new route through 10.49.175.250
Jan 18 22:54:25 	php: rc.newwanip: Removing static route for monitor 8.8.4.4 and adding a new route through 10.49.174.250
Jan 18 22:54:29 	check_reload_status: updating dyndns OPT1GW
Jan 18 22:54:29 	check_reload_status: Restarting ipsec tunnels
Jan 18 22:54:29 	check_reload_status: Restarting OpenVPN tunnels/interfaces
Jan 18 22:54:29 	check_reload_status: Reloading filter
Jan 18 22:54:30 	php: rc.newwanip: Resyncing OpenVPN instances for interface OPT1.
Jan 18 22:54:30 	php: rc.newwanip: Creating rrd update script
Jan 18 22:54:32 	php: rc.newwanip: pfSense package system has detected an ip change 0.0.0.0 -> 10.49.174.1 ... Restarting packages.
Jan 18 22:54:33 	check_reload_status: Starting packages
Jan 18 22:54:40 	php: rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed its IP. Reloading endpoints that may use OPT1GW.
Jan 18 22:54:45 	php: rc.start_packages: Restarting/Starting all packages.
Jan 18 22:54:53 	check_reload_status: Reloading filter

Gateway Log:

Jan 18 22:54:19 	apinger: alarm canceled: OPT1GW(8.8.4.4) *** OPT1GWdown ***
Jan 18 22:54:25 	apinger: SIGHUP received, reloading configuration.

/tmp/rules.debug has quickly changed back to:

GWNcellPriority = "  route-to { ( vr2 10.49.174.250 )  }  "

checkip has changed back to OPT1 public IP - good
Internet browsing has failed back to OPT1 inside 1 minute - good

There are log meesages in 2.1 from rc.linkup and rc.newwanip that do good stuff for failover - none of that appears in the 2.1.1 logs. 2.1.1 has the apinger exit status 255 thing.

What happened in 2.1.1? Is it something in pfsense-tools that was changed recently and intended for 2.2?

phil.davis

There have been some changes to check_reload_status in pfsense-tools. Maybe they will help the failover processing?
I don't have a suitable multi-WAN test environment at the moment (I am traveling). But On Sunday I will be able to do real testing. So please let me know if there is any change worth testing and I can get the latest snapshot on Saturday night or Sunday and test it.

jimp

Yes, the snapshots from Thursday afternoon or later should have the check_reload_status fixes which hopefully will address the behavior some are seeing in this and other similar threads. If someone can upgrade and test again it would be appreciated.

athurdent

My test KVM still has that error when a gateway goes down, so I didn't try on my "production" machine because of that:

Jan 25 07:59:00 pfsense-kvm.local-lan apinger: Error while starting command form alarm(down) on target(192.168.xxx.254-WANGW)
Jan 25 07:59:00 pfsense-kvm.local-lan apinger: command (/usr/local/sbin/pfSctl -c 'service reload dyndns WANGW' -c 'service reload ipsecdns' -c 'service reload openvpn WANGW' -c 'filter reload' ) exited with status: 255

athurdent

Tried the latest snapshot ( Sat Jan 25 10:43:17 EST 2014 ) and the problem is still there.

phil.davis

2.1.1-PRERELEASE (i386)
built on Sat Jan 25 10:00:56 EST 2014
FreeBSD 8.3-RELEASE-p14

Test Alix 2D13 with its WAN connected to my home LAN. The test WAN gets DHCP from home LAN. WAN gateway monitor IP is set to an external IP on the real internet. Things are running nicely.
I pull the home WAN as a test - as expected the test WAN gateway goes to "pending" and eventually "offline" as apinger fails to get ping response from the monitor IP. So apinger takes some action, but this still appears in the gateway log:

Jan 26 22:30:47 	apinger: ALARM: WAN_DHCP(216.146.35.35) *** WAN_DHCPdown ***
Jan 26 22:30:57 	apinger: Error while starting command form alarm(WAN_DHCPdown) on target(216.146.35.35-WAN_DHCP)
Jan 26 22:30:57 	apinger: command (/usr/local/sbin/pfSctl -c 'service reload dyndns WAN_DHCP' -c 'service reload ipsecdns' -c 'service reload openvpn WAN_DHCP' -c 'filter reload' ) exited with status: 255

Then I connect home WAN again (so I can post this :) ). apinger on test pfSense starts getting ping response from the monitor IP - good. It attempts to take action when it decides the WAN gateway is up, and this in the gateway log:

Jan 26 22:34:00 	apinger: alarm canceled: WAN_DHCP(216.146.35.35) *** WAN_DHCPdown ***
Jan 26 22:34:10 	apinger: Error while starting command form alarm(WAN_DHCPdown) on target(216.146.35.35-WAN_DHCP)
Jan 26 22:34:10 	apinger: command (/usr/local/sbin/pfSctl -c 'service reload dyndns WAN_DHCP' -c 'service reload ipsecdns' -c 'service reload openvpn WAN_DHCP' -c 'filter reload' ) exited with status: 255

So the "exited with status: 255" thing happens for all gateway transitions - no need for complicated multi-WAN failover scenarios to see this error message.

jimp

Finally locked in a fix for this. New snapshots building now should be OK. It'll be a few hours before they upload.

athurdent

@jimp:

Finally locked in a fix for this. New snapshots building now should be OK. It'll be a few hours before they upload.

Great thanks, no more error messages.

2 questions for things I ran into while testing failover with that new version:

Are states still flushed when a gateway goes down and the option "State Killing on Gateway Failure" is not checked? I have default gateway switching on.
I started a ping to an external host, pulled my primary gateway and the ping stopped, never recovered. Shouldn't it recover and automatically use my second gateway?

Is there a configure option to switch of those
MONITOR: <gateway>is down, removing from routing group <group>messages? They keep on coming and filling my mailbox as long as one gateway is down, but one message per gateway group would really be enough. And even worse is that one never gets a mail when the gateway has recovered, so this whole feature is a little useless (at least for me) and I have to rely on my monitoring system anyway.</group></gateway>

athurdent

Regarding state flushing, it seems that at some point all the states are flushed, not just those of the gateway going down.
I would suggest to modify the anti-lockout rule like this, using the "no state" feature of pf:

# make sure the user cannot lock himself out of the webConfigurator or SSH
pass in  quick on {$lanif} proto tcp from any to ({$lanif}) port { {$alports} } no state label "anti-lockout rule"
pass out quick on {$lanif} proto tcp from ({$lanif}) port { {$alports} } to any no state label "anti-lockout rule"

You can keep on using your ssh/gui session that way even if all states get flushed. Also the gui feature "Reset States" would profit from that. You could get rid of:
"NOTE: If you reset the firewall state table, the browser session may appear to be hung after clicking "Reset". Simply refresh the page to continue."

doktornotor

@jimp:

Finally locked in a fix for this. New snapshots building now should be OK. It'll be a few hours before they upload.

Confirmed fixed, finally… Yay! 8)

phil.davis

2.1.1-PRERELEASE (i386)
built on Sun Feb 2 12:42:30 EST 2014
FreeBSD 8.3-RELEASE-p14

All is well. I tried pulling out the phone line on the ADSL (default gateway), the dynamic DNS name, OpenVPN road warrior server and 2 OpenVPN site-2-site clients all switched to using OPT1. Plugged in the phone line again, ADSL negotiated, apinger detected the WAN online again, everything failed back.
Pulled the cable on OPT1 (which had general internet traffic directed to it as tier1 of a gateway group). General browsing failed over to WAN. OpenVPN server and clients remained running untouched (as they should because they were already on WAN - so no need to restart them).
During messing about, ADSL WAN went down by itself - had some minutes with both WAN and OPT1 down, and things recovered fine from that as links became available again. Always good to have the ISP give you a real test ;)
Gateways log tab has nice clean entries like this:

Feb 3 10:24:32 	apinger: Starting Alarm Pinger, apinger(28478)
Feb 3 10:30:25 	apinger: SIGHUP received, reloading configuration.
Feb 3 10:52:28 	apinger: ALARM: WANGW(8.8.8.8) *** down ***
Feb 3 11:00:52 	apinger: alarm canceled: WANGW(8.8.8.8) *** down ***
Feb 3 11:08:57 	apinger: ALARM: WANGW(8.8.8.8) *** down ***
Feb 3 11:09:10 	apinger: ALARM: OPT1GW(8.8.4.4) *** OPT1GWdown ***
Feb 3 11:10:49 	apinger: SIGHUP received, reloading configuration.
Feb 3 11:12:59 	apinger: alarm canceled: WANGW(8.8.8.8) *** down ***
Feb 3 11:20:12 	apinger: SIGHUP received, reloading configuration.
Feb 3 11:20:23 	apinger: alarm canceled: OPT1GW(8.8.4.4) *** OPT1GWdown ***