Upgrade success (Edit: not quite)

cmb

does your gateway status change under Status>Gateways? What is apinger logging?

athurdent

Thanks for looking into this, cmb!

My IPv6 tunnel (gif0 tunnel to SixXs) also depends on my DHCP cable connection, so that one goes down to, which is expected, just fyi.
Both Gateways show Loss 100% and Offline.

Apinger Log:

Jan 18 17:14:24 vpn apinger: ALARM: WAN1_DHCP(5.xxx.xxx.1)  *** down ***
Jan 18 17:14:24 vpn apinger: ALARM: WAN_SixXS(2001:xxxx:xxxx:xxxx::1)  *** down ***
Jan 18 17:14:34 vpn apinger: Error while starting command form alarm(down) on target(5.xxx.xxx.1-WAN1_DHCP)
Jan 18 17:14:34 vpn apinger: command (/usr/local/sbin/pfSctl -c 'service reload dyndns WAN1_DHCP' -c 'service reload ipsecdns' -c 'service reload openvpn WAN1_DHCP' -c 'filter reload' ) exited with status: 255
Jan 18 17:14:34 vpn apinger: Error while starting command form alarm(down) on target(2001:xxxx:xxxx:xxxx::1-WAN_SixXS)
Jan 18 17:14:34 vpn apinger: command (/usr/local/sbin/pfSctl -c 'service reload dyndns WAN_SixXS' -c 'service reload ipsecdns' -c 'service reload openvpn WAN_SixXS' -c 'filter reload' ) exited with status: 255

phil.davis

I have setup 2 WANs:
WAN (vr1) - gateway 10.49.175.250
OPT1 (vr2) - gateway 10.49.174.250
NcellPriority gateway group - OPT1GW tier 1, WANGW tier 2
Firewall rule to policy-route general traffic into NcellPriority gateway group.

Both gateways up, traffic routes through OPT1, correct.
/tmp/rules.debug has the correct line directing to OPT1:

GWNcellPriority = "  route-to { ( vr2 10.49.174.250 )  }  "

Pull the cable from OPT1 (vr2)

System Log:

Jan 18 21:53:47 	kernel: vr2: link state changed to DOWN

Gateway log tab:

Jan 18 21:54:15 	apinger: ALARM: OPT1GW(8.8.4.4) *** OPT1GWdown ***
Jan 18 21:54:25 	apinger: Error while starting command form alarm(OPT1GWdown) on target(8.8.4.4-OPT1GW)
Jan 18 21:54:25 	apinger: command (/usr/local/sbin/pfSctl -c 'service reload dyndns OPT1GW' -c 'service reload ipsecdns' -c 'service reload openvpn OPT1GW' -c 'filter reload' ) exited with status: 255

Look in /tmp/rules.debug right up to 21:59 and it still has:

GWNcellPriority = "  route-to { ( vr2 10.49.174.250 )  }  "

A couple of seconds after 22:00, the line changes to:

GWNcellPriority = "  route-to { ( vr1 10.49.175.250 )  }  "

and now traffic is routed out WAN (vr1), and the general traffic failover is effective.
The regular cron job that runs every 15 minutes has done a recalculation of the rules and put the currently-correct thing in place.

Now I put the cable back in to OPT1 at 22:08
System log:

Jan 18 22:08:49 	kernel: vr2: link state changed to UP

Gateway log tab:

Jan 18 22:08:51 	apinger: alarm canceled: OPT1GW(8.8.4.4) *** OPT1GWdown ***
Jan 18 22:09:01 	apinger: Error while starting command form alarm(OPT1GWdown) on target(8.8.4.4-OPT1GW)
Jan 18 22:09:01 	apinger: command (/usr/local/sbin/pfSctl -c 'service reload dyndns OPT1GW' -c 'service reload ipsecdns' -c 'service reload openvpn OPT1GW' -c 'filter reload' ) exited with status: 255

But /tmp/rules.debug up to 10:14 still has:

GWNcellPriority = "  route-to { ( vr1 10.49.175.250 )  }  "

then after 10:15 it changes to:

GWNcellPriority = "  route-to { ( vr2 10.49.174.250 )  }  "

and the failback/recovery to OPT1 is effective at the 15-minute filter reload.

filter_configure_sync does not seem to be happening at the time of the gateway status change, only at the 15-minute cron job interval.

Now to switch slices back to 2.1 and see if it was always like this…

phil.davis

With pfSense 2.1-RELEASE and exactly the same config as the post above, the results are good and log messages are quite different:
At the start, with all links up, /tmp/rules.debug has:

GWNcellPriority = "  route-to { ( vr2 10.49.174.250 )  }  "

Pull OPT1 cable:
System Log:

Jan 18 22:50:59 	kernel: vr2: link state changed to DOWN
Jan 18 22:51:03 	php: rc.linkup: Hotplug event detected for OPT1(opt1) but ignoring since interface is configured with static IP (10.49.174.1 )
Jan 18 22:51:38 	check_reload_status: updating dyndns OPT1GW
Jan 18 22:51:38 	check_reload_status: Restarting ipsec tunnels
Jan 18 22:51:38 	check_reload_status: Restarting OpenVPN tunnels/interfaces
Jan 18 22:51:38 	check_reload_status: Reloading filter

Gateways Log:

Jan 18 22:51:28 	apinger: ALARM: OPT1GW(8.8.4.4) *** OPT1GWdown ***

/tmp/rules.debug changes within 1 minute to have:

GWNcellPriority = "  route-to { ( vr1 10.49.175.250 )  }  "

checkip has changed to the WAN public IP - good
Internet browsing has failed over to WAN inside 1 minute - good

Replace OPT1 cable:
System Log:

Jan 18 22:54:19 	check_reload_status: rc.newwanip starting vr2
Jan 18 22:54:25 	php: rc.newwanip: rc.newwanip: Informational is starting vr2.
Jan 18 22:54:25 	php: rc.newwanip: rc.newwanip: on (IP address: 10.49.174.1) (interface: opt1) (real interface: vr2).
Jan 18 22:54:25 	php: rc.newwanip: Removing static route for monitor 8.8.8.8 and adding a new route through 10.49.175.250
Jan 18 22:54:25 	php: rc.newwanip: Removing static route for monitor 8.8.4.4 and adding a new route through 10.49.174.250
Jan 18 22:54:29 	check_reload_status: updating dyndns OPT1GW
Jan 18 22:54:29 	check_reload_status: Restarting ipsec tunnels
Jan 18 22:54:29 	check_reload_status: Restarting OpenVPN tunnels/interfaces
Jan 18 22:54:29 	check_reload_status: Reloading filter
Jan 18 22:54:30 	php: rc.newwanip: Resyncing OpenVPN instances for interface OPT1.
Jan 18 22:54:30 	php: rc.newwanip: Creating rrd update script
Jan 18 22:54:32 	php: rc.newwanip: pfSense package system has detected an ip change 0.0.0.0 -> 10.49.174.1 ... Restarting packages.
Jan 18 22:54:33 	check_reload_status: Starting packages
Jan 18 22:54:40 	php: rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed its IP. Reloading endpoints that may use OPT1GW.
Jan 18 22:54:45 	php: rc.start_packages: Restarting/Starting all packages.
Jan 18 22:54:53 	check_reload_status: Reloading filter

Gateway Log:

Jan 18 22:54:19 	apinger: alarm canceled: OPT1GW(8.8.4.4) *** OPT1GWdown ***
Jan 18 22:54:25 	apinger: SIGHUP received, reloading configuration.

/tmp/rules.debug has quickly changed back to:

GWNcellPriority = "  route-to { ( vr2 10.49.174.250 )  }  "

checkip has changed back to OPT1 public IP - good
Internet browsing has failed back to OPT1 inside 1 minute - good

There are log meesages in 2.1 from rc.linkup and rc.newwanip that do good stuff for failover - none of that appears in the 2.1.1 logs. 2.1.1 has the apinger exit status 255 thing.

What happened in 2.1.1? Is it something in pfsense-tools that was changed recently and intended for 2.2?

phil.davis

There have been some changes to check_reload_status in pfsense-tools. Maybe they will help the failover processing?
I don't have a suitable multi-WAN test environment at the moment (I am traveling). But On Sunday I will be able to do real testing. So please let me know if there is any change worth testing and I can get the latest snapshot on Saturday night or Sunday and test it.

jimp

Yes, the snapshots from Thursday afternoon or later should have the check_reload_status fixes which hopefully will address the behavior some are seeing in this and other similar threads. If someone can upgrade and test again it would be appreciated.

athurdent

My test KVM still has that error when a gateway goes down, so I didn't try on my "production" machine because of that:

Jan 25 07:59:00 pfsense-kvm.local-lan apinger: Error while starting command form alarm(down) on target(192.168.xxx.254-WANGW)
Jan 25 07:59:00 pfsense-kvm.local-lan apinger: command (/usr/local/sbin/pfSctl -c 'service reload dyndns WANGW' -c 'service reload ipsecdns' -c 'service reload openvpn WANGW' -c 'filter reload' ) exited with status: 255

athurdent

Tried the latest snapshot ( Sat Jan 25 10:43:17 EST 2014 ) and the problem is still there.

phil.davis

2.1.1-PRERELEASE (i386)
built on Sat Jan 25 10:00:56 EST 2014
FreeBSD 8.3-RELEASE-p14

Test Alix 2D13 with its WAN connected to my home LAN. The test WAN gets DHCP from home LAN. WAN gateway monitor IP is set to an external IP on the real internet. Things are running nicely.
I pull the home WAN as a test - as expected the test WAN gateway goes to "pending" and eventually "offline" as apinger fails to get ping response from the monitor IP. So apinger takes some action, but this still appears in the gateway log:

Jan 26 22:30:47 	apinger: ALARM: WAN_DHCP(216.146.35.35) *** WAN_DHCPdown ***
Jan 26 22:30:57 	apinger: Error while starting command form alarm(WAN_DHCPdown) on target(216.146.35.35-WAN_DHCP)
Jan 26 22:30:57 	apinger: command (/usr/local/sbin/pfSctl -c 'service reload dyndns WAN_DHCP' -c 'service reload ipsecdns' -c 'service reload openvpn WAN_DHCP' -c 'filter reload' ) exited with status: 255

Then I connect home WAN again (so I can post this :) ). apinger on test pfSense starts getting ping response from the monitor IP - good. It attempts to take action when it decides the WAN gateway is up, and this in the gateway log:

Jan 26 22:34:00 	apinger: alarm canceled: WAN_DHCP(216.146.35.35) *** WAN_DHCPdown ***
Jan 26 22:34:10 	apinger: Error while starting command form alarm(WAN_DHCPdown) on target(216.146.35.35-WAN_DHCP)
Jan 26 22:34:10 	apinger: command (/usr/local/sbin/pfSctl -c 'service reload dyndns WAN_DHCP' -c 'service reload ipsecdns' -c 'service reload openvpn WAN_DHCP' -c 'filter reload' ) exited with status: 255

So the "exited with status: 255" thing happens for all gateway transitions - no need for complicated multi-WAN failover scenarios to see this error message.

jimp

Finally locked in a fix for this. New snapshots building now should be OK. It'll be a few hours before they upload.

athurdent

@jimp:

Finally locked in a fix for this. New snapshots building now should be OK. It'll be a few hours before they upload.

Great thanks, no more error messages.

2 questions for things I ran into while testing failover with that new version:

Are states still flushed when a gateway goes down and the option "State Killing on Gateway Failure" is not checked? I have default gateway switching on.
I started a ping to an external host, pulled my primary gateway and the ping stopped, never recovered. Shouldn't it recover and automatically use my second gateway?

Is there a configure option to switch of those
MONITOR: <gateway>is down, removing from routing group <group>messages? They keep on coming and filling my mailbox as long as one gateway is down, but one message per gateway group would really be enough. And even worse is that one never gets a mail when the gateway has recovered, so this whole feature is a little useless (at least for me) and I have to rely on my monitoring system anyway.</group></gateway>

athurdent

Regarding state flushing, it seems that at some point all the states are flushed, not just those of the gateway going down.
I would suggest to modify the anti-lockout rule like this, using the "no state" feature of pf:

# make sure the user cannot lock himself out of the webConfigurator or SSH
pass in  quick on {$lanif} proto tcp from any to ({$lanif}) port { {$alports} } no state label "anti-lockout rule"
pass out quick on {$lanif} proto tcp from ({$lanif}) port { {$alports} } to any no state label "anti-lockout rule"

You can keep on using your ssh/gui session that way even if all states get flushed. Also the gui feature "Reset States" would profit from that. You could get rid of:
"NOTE: If you reset the firewall state table, the browser session may appear to be hung after clicking "Reset". Simply refresh the page to continue."

doktornotor

@jimp:

Finally locked in a fix for this. New snapshots building now should be OK. It'll be a few hours before they upload.

Confirmed fixed, finally… Yay! 8)

phil.davis

2.1.1-PRERELEASE (i386)
built on Sun Feb 2 12:42:30 EST 2014
FreeBSD 8.3-RELEASE-p14

All is well. I tried pulling out the phone line on the ADSL (default gateway), the dynamic DNS name, OpenVPN road warrior server and 2 OpenVPN site-2-site clients all switched to using OPT1. Plugged in the phone line again, ADSL negotiated, apinger detected the WAN online again, everything failed back.
Pulled the cable on OPT1 (which had general internet traffic directed to it as tier1 of a gateway group). General browsing failed over to WAN. OpenVPN server and clients remained running untouched (as they should because they were already on WAN - so no need to restart them).
During messing about, ADSL WAN went down by itself - had some minutes with both WAN and OPT1 down, and things recovered fine from that as links became available again. Always good to have the ISP give you a real test ;)
Gateways log tab has nice clean entries like this:

Feb 3 10:24:32 	apinger: Starting Alarm Pinger, apinger(28478)
Feb 3 10:30:25 	apinger: SIGHUP received, reloading configuration.
Feb 3 10:52:28 	apinger: ALARM: WANGW(8.8.8.8) *** down ***
Feb 3 11:00:52 	apinger: alarm canceled: WANGW(8.8.8.8) *** down ***
Feb 3 11:08:57 	apinger: ALARM: WANGW(8.8.8.8) *** down ***
Feb 3 11:09:10 	apinger: ALARM: OPT1GW(8.8.4.4) *** OPT1GWdown ***
Feb 3 11:10:49 	apinger: SIGHUP received, reloading configuration.
Feb 3 11:12:59 	apinger: alarm canceled: WANGW(8.8.8.8) *** down ***
Feb 3 11:20:12 	apinger: SIGHUP received, reloading configuration.
Feb 3 11:20:23 	apinger: alarm canceled: OPT1GW(8.8.4.4) *** OPT1GWdown ***