BUG - C2758 10GBE during WAN failover erroneously prefers LAN 10GBE gateway
-
I have a customer running pfsense 2.3 with 2 x C2758 firewalls, 10 GBE and WAN failover.
The observed problem
Last night, during failover testing, the master Firewall web gui was very slow to respond and I observed the system log reporting LAN_L3_SW selected as default gateway! The firewall could not ping anything on the internet despite WAN2GW alive and well.
The IPSEC VPN's bound to WANgroup interface and using ddns didn't come back up.
To get this to occur, I unplugged the WANGW device (not the actual WAN interface on the firewall). Thus no CARP failover occurred, but quickly WANGW was correctly determined to be offline.
Interfaces
-
LAN is interface clx1 (10 GBE)
-
WAN is interface igb0 (1 GBE) and a member of a gateway group WANgroup with priority Tier 1
-
WAN2 is interface igb1 (1 GBE) and a member of a gateway group WANgroup with priority Tier 2
Gateways configured
-
WANGW (correctly setup on interface WAN as the IPv4 upstream gateway) and thus reachable via interface igb0
-
WAN2GW (correctly setup on interface WAN2 as the IPv4 upstream gateway) and thus reachable via interface igb1
-
LAN_L3_SW - a LAN Layer 3 switch, used for static routes to a bunch of RFC1918 subnets managed and routed by the Layer 3 switch, thus reachable via clx1 (the 10 GBE interface). The LAN interface does NOT have an "IPv4 Upstream gateway" selected in the interface setup screen and thus reachable via interface clx1.
The suspected bug
It would appear that the routing table favours 10 GBE interfaces and thus erroneously selects the LAN_L3_SW as the entire firewall default gateway.
-
-
Do you have 'enable default gateway switching' checked under Advanced/Misc ? I would leave this unchecked and specify one of the system DNS servers to use WAN2.
-
Yes, gateway switching UNCHECK and 2 x DNS servers set for WAN and 2 x DNS servers set for WAN2.
I really think the routing table has a serious bug and selects the 10 GBE gateways over and above 1 GBE interfaces.
-
If I'm right, then, it might well be this is a bug that has existed for a while and it's just my scenario that makes it show up.
Consider:
-
LAN using 10 GBE and WAN using 10 GBE - no problem with WAN failover as all WAN gateways using higher preference 10 GBE anyway
-
LAN using 1 GB and WAN using 10 GBE - no problem with WAN failover as all WAN gateways using higher preference 10 GBE anyway
-
LAN using 10 GB and WAN using 1 GBE - Problem! Because gateway selection prefers to use a gateway on a 10 GBE interface
Thus many installations with 10 GBE would not notice this unless their WAN interfaces are using 1 GBE.
-
-
Failover has no consideration at all for the speed of the interface. It strictly goes according to your defined tiers. Traffic matching a rule specifying a gateway group will never use a gateway that isn't chosen in that group. Some kind of config issue, probably traffic not actually matching a rule with the gateway group from the sounds of it.
-
Yes, you might well think that, but, I'm fairly sure it's not a config issue.
The firewall selected the LAN_L3_SW as stated in the firewall system log as the default gateway, which, is nuts, because that gateway is attached to the LAN and correctly set in the LAN as NOT having an IPv4 upstream gateway.
The firewall further could not ping anything on the internet despite WAN2GW up.
This should be fairly easy for someone else to replicate and confirm.
Another site, same customer, using clustered C2758 with WAN and WAN2, but, the optional interface in the C2758 is populated with 4 x 1GBE interfaces (igbx's) works flawlessly and thus I can look at that config and check screen by screen that the outbound NAT policy etc is correct.
-
That is also to say, after unlugging the WANgw, from the console, the firewall could not ping anything on the internet.
-
Traffic from the host itself doesn't follow your gateway groups. If you have default gateway switching enabled, it switches the default to the next gateway in the list, in top to bottom order.
-
@cmb:
Traffic from the host itself doesn't follow your gateway groups. If you have default gateway switching enabled, it switches the default to the next gateway in the list, in top to bottom order.
When you say "in top to bottom order" are you talking alphabetically or what? Because AFAICT there is no way to re-order them in the list.
I've had this problem/question myself for a while. Seems there should be a checkbox on the GW config page that says "Skip this gateway during failover events" or something to that effect. Because there is nothing in the GUI that defines whether a particular GW is "internet facing" or not. And we have seen that when pfSense itself has no internet connectivity, the GUI can become extremely slow or unresponsive.