WAN Not Recovering with Multiple Gateways
-
I have an SG-2440 running 2.3.3-p1 which is not automatically recovering the WAN link when upstream goes down periodically. The pfSense appliance is set up with en0 as WAN, en3-4 as LACP to an SG300 switch in L3 Mode, and en1 is unused. I run multiple VLANs on the SG300, and use a transport VLAN for moving data between pfSense and the switch. As such, I have a gateway IP for the switch and static routes to each VLAN (each is associated with a private /24 address block) set up within pfSense to route traffic appropriately. All VLANs are managed on the switch, not pfSense.
It appears that when the WAN link goes down upstream, pfSense does not try to recover the link because it sees another gateway still available (the SG300 switch). The fastest recovery methods I have are to ask someone onsite to pull the en0,2,3 connections for a few seconds or reboot pfSense (if they have admin access).
For self-recovery to work as expected, I'm thinking I need to tell pfSense to not treat the SG300 gateway IP as a WAN-facing link, but I'm not sure what the appropriate setting is. The Gateway Monitoring / Gateway Action options don't appear to do what I'm looking for, assuming I'm on the right track here.
-
Tried working on this issue some more, but haven't yet found a way to improve recovery time after the ISP connection comes back up. I'm not sure now that this has to do with having two gateways set up. Right now I have gateway monitoring disabled on the IP that points to the SG300 switch. I also had set the WAN interface (igb0) to reject leases from the modem's IP address of 192.168.100.1. Neither adjustment seemed to change the time to recover after the last two ISP outages. I still have to wait for an hour or more after the modem indicated link recovery until pfSense was able to pass traffic to it again.
During the outage prior to that using the same pfSense config, after the modem indicates link recovery, removing/reconnecting the WAN network cable restored connectivity immediately. No unplug cycle on the switch-facing links was necessary. I'm guessing the extended delay time if I don't intervene after an ISP outage is the DHCP lease renewal interval counting down to a certain percentage, where pfSense then recovers connectivity on its own. What I get from logs during these outages is:
Apr 20 07:13:17 dpinger WAN_DHCP 184.88.32.1: sendto error: 64 Apr 20 07:13:18 dpinger WAN_DHCP 184.88.32.1: sendto error: 64 Apr 20 07:13:18 dpinger WAN_DHCP 184.88.32.1: sendto error: 64 Apr 20 07:13:19 dpinger WAN_DHCP 184.88.32.1: sendto error: 64 Apr 20 07:13:19 dpinger WAN_DHCP 184.88.32.1: sendto error: 64 Apr 20 07:13:20 dpinger WAN_DHCP 184.88.32.1: sendto error: 64 Apr 20 07:13:20 dpinger WAN_DHCP 184.88.32.1: sendto error: 64 Apr 20 07:13:21 dpinger WAN_DHCP 184.88.32.1: sendto error: 64 Apr 20 07:13:21 dpinger WAN_DHCP 184.88.32.1: sendto error: 64 Apr 20 07:13:22 dpinger WAN_DHCP 184.88.32.1: sendto error: 64 Apr 20 07:13:22 dpinger WAN_DHCP 184.88.32.1: sendto error: 64 Apr 20 07:13:23 dpinger WAN_DHCP 184.88.32.1: sendto error: 64 Apr 20 07:13:23 dpinger WAN_DHCP 184.88.32.1: sendto error: 64 Apr 20 07:13:24 dpinger WAN_DHCP 184.88.32.1: sendto error: 64 Apr 20 07:13:24 dpinger WAN_DHCP 184.88.32.1: sendto error: 64 Apr 20 07:13:25 dpinger WAN_DHCP 184.88.32.1: sendto error: 64 Apr 20 07:13:25 dpinger WAN_DHCP 184.88.32.1: sendto error: 64 Apr 20 07:13:26 dpinger WAN_DHCP 184.88.32.1: sendto error: 64 Apr 20 07:13:26 dpinger WAN_DHCP 184.88.32.1: sendto error: 64 Apr 20 07:13:27 dpinger WAN_DHCP 184.88.32.1: sendto error: 64 Apr 20 07:13:27 dpinger WAN_DHCP 184.88.32.1: sendto error: 64 Apr 20 07:13:28 dpinger WAN_DHCP 184.88.32.1: sendto error: 64 Apr 20 07:13:28 dpinger WAN_DHCP 184.88.32.1: sendto error: 64 Apr 20 07:13:29 dpinger WAN_DHCP 184.88.32.1: sendto error: 64 Apr 20 07:13:29 dpinger WAN_DHCP 184.88.32.1: sendto error: 64 Apr 20 07:13:30 dpinger WAN_DHCP 184.88.32.1: sendto error: 64 Apr 20 07:13:30 dpinger WAN_DHCP 184.88.32.1: sendto error: 64 Apr 20 07:13:31 dpinger WAN_DHCP 184.88.32.1: sendto error: 64 Apr 20 07:13:31 dpinger WAN_DHCP 184.88.32.1: sendto error: 64 Apr 20 07:13:32 dpinger WAN_DHCP 184.88.32.1: sendto error: 65 Apr 20 07:13:32 dpinger WAN_DHCP 184.88.32.1: sendto error: 65 Apr 20 07:13:33 dpinger WAN_DHCP 184.88.32.1: sendto error: 65 Apr 20 07:13:33 dpinger WAN_DHCP 184.88.32.1: sendto error: 65 Apr 20 07:13:34 dpinger WAN_DHCP 184.88.32.1: sendto error: 65 Apr 20 07:13:34 dpinger WAN_DHCP 184.88.32.1: sendto error: 65 Apr 20 07:13:35 dpinger WAN_DHCP 184.88.32.1: sendto error: 65 Apr 20 07:13:35 dpinger WAN_DHCP 184.88.32.1: sendto error: 65 Apr 20 07:13:36 dpinger WAN_DHCP 184.88.32.1: sendto error: 65 Apr 20 07:13:36 dpinger WAN_DHCP 184.88.32.1: sendto error: 65 Apr 20 07:13:37 dpinger WAN_DHCP 184.88.32.1: sendto error: 65 Apr 20 07:13:37 dpinger WAN_DHCP 184.88.32.1: sendto error: 65 Apr 20 07:13:38 dpinger WAN_DHCP 184.88.32.1: sendto error: 65 Apr 20 07:13:38 dpinger WAN_DHCP 184.88.32.1: sendto error: 65 Apr 20 07:13:39 dpinger WAN_DHCP 184.88.32.1: sendto error: 65 Apr 20 07:13:39 dpinger WAN_DHCP 184.88.32.1: sendto error: 65 Apr 20 07:13:40 dpinger WAN_DHCP 184.88.32.1: sendto error: 65 Apr 20 07:13:40 dpinger WAN_DHCP 184.88.32.1: sendto error: 65 Apr 20 07:13:41 dpinger WAN_DHCP 184.88.32.1: sendto error: 65 Apr 20 07:13:43 dpinger send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr 184.88.32.1 bind_addr 184.88.44.86 identifier "WAN_DHCP "
Assuming dpinger should be the agent triggering recovery actions, if it doesn't know how to handle this kind of outage on its own, I might end up just implementing a less-than-ideal cron script to check a few IPs periodically and cycle the interface if none reply. Not a good solution, but its all I can think to do at the moment.