After CARP failover packets go out the wrong WAN
-
Hello,
this question has been asked before, e.g. here, but I could not find any definitive answers. Does the combination of CARP (on the LAN) and Multi-WAN, and policy routing to non-default WAN gateways, really work?
In my situation, I have two identical firewalls running 2.5.2. All interfaces are configured identically with regard to their order, pfSense name, OS name, CARP vhid, appearance on Status/Interfaces, and everything else I can think of to compare. There are ~10 LANs (VLANs on a lagg) and two WANs (base interface and one VLAN on the same interface) going to different ISPs via a switch. SYNC is a dedicated interface on each firewall.
The LANs use CARP to present a VIP that is the gateway used by the systems on these networks, the WANs use CARP to have a common address that is used in the outgoing NAT rules.
CARP works correctly on all interfaces, pfsync works correctly, and so does config (XMLRPC) sync.
There are two gateway groups that include the two WANs in either order (group 1 is WAN1 as tier 1 and WAN2 as tier 2, group 2 is the opposite) so I can select which WAN is used by preference and let it fail over to the other one if either goes down. Default gateway on both firewalls is group 1.
For simplicity, I will consider only one filter rule for outgoing ping that has its gateway set to group 2, i.e. it sends via WAN2 by preference. This rule works correctly while the primary firewall is CARP master on the LAN. As soon as the standby takes over (after disabling CARP on the primary), however, it starts sending out packets on WAN1 with the source address NATed to the CARP address for WAN2. This obviously does not work. Clearing states on the former standby lets the pinging continue, but breaks all other connections.
If I change the rule to use gateway group 1, where the policy route points at WAN1 (which is the default gateway anyway), I lose a ping or two across the failover, but then everything continues to work.
To me this looks like the standby uses the established state to pass my pings (as it should) but is not aware of the policy routing. I have found the statement that
With Multi-WAN a firewall rule must be in place to pass traffic to local networks using the default gateway. Otherwise, when traffic attempts to reach the CARP address or from LAN to DMZ it will instead go out a WAN connection.
, but I'm not sure that it applies in this situation, nor even what it means. My LANs are separate for a reason; I cannot just put in a rule that passes everything everywhere, and I'm not clear on whether multiple rules passing some packets will work the same as a single rule passing everything.
I have the suspicion that this (policy routing to non-default WAN gateways after CARP failover) actually cannot work because pfsync does not sync the route-to destination. There is an open FreeBSD bug on this issue with no activity since 2019. Looking through the diff between pfSense's and FreeBSD's kernel I also do not see a change related to this. Of course, I may be wrong.
Thanks for any help. If more information is needed, I will do my best to provide it.
-
This is getting ever more mysterious. I replaced the 2.5.2 firewalls with 2.6.0 because I saw that 2.6.0 now has a gateway field in its states (i.e. pfctl -vvss). As expected, the filter rule I mentioned above now creates states with the gateway set to the ISP router on my WAN2.
When the CARP failover happens, the new CARP master still behaves the same as 2.5.2 did: It NATs the source address to the WAN2 CARP VIP, then sends the packet via its default route, the WAN1 interface to the WAN1 ISP router. The result of changing the policy route so it matches the default route is also still the same: It keeps working.
I can see the new field in the state table entries, and it is probably there for a reason, so there must be something wrong with my setup.
I have rechecked that the interfaces are configured identically, and they are. The differences between the output of ifconfig on both firewalls are limited to the MAC addresses, IPv6 link-local addresses, non-CARP addresses, and the CARP state and advskew values. In config.xml, the only differences in the <interfaces> section are in <ipaddr>s. Otherwise, every line is the same, and in the same order.
The two WAN links used to be on the base interface (untagged) and a VLAN on the same physical interface. To ensure that the VLAN is not interfering (and to avoid possibly misleading output from tcpdump) I moved it to a separate physical interface. No change.
I also tried a TCP connection in addition to pinging, and it also behaves the same.
-
OK, I give up.
I have set up a minimal (virtual) test lab with one "local" host, two local router/firewalls, two ISP routers and one "Internet" host, with CARP on the local network and the two WANs. Everything runs pfSense 2.6.0, and the results are exactly the same as in my production setup.
My conclusion is that the combination of multiple firewalls, multiple WANs, and policy routing to something that is not the default gateway on the firewalls, does not work.
I'll be happy to provide details if someone would like to prove me wrong.
-
Since you can reproduce it I suggest opening a report at redmine.pfsense.org. (assuming there isn't one already) And then reference this thread for discussion. I haven't used multi-WAN and CARP together, sorry.
-
https://redmine.pfsense.org/issues/10513 (from 2020) looks like it is about the same behavior, plus another problem that I have not seen, but not looked for either.
-
@chrullrich Any new info since your last post? I'm following this topic closely.
-
@luckman212 'fraid not. As far as I can tell the functionality is missing from FreeBSD and there is little I can do about that.
-
@chrullrich I replaced the pfSense 2.6 "local router/firewall"s in my test setup with OPNsense 22.1 (this is FreeBSD 13.0 instead of pfSense 2.6's 12.3) to get a second opinion. The behavior is the same: As soon as the CARP failover happens, everything sent towards the "Internet" goes out the default route with the NATed source address appropriate for the policy route.
When I tried it the first time today I thought I saw ping (and only ping) work correctly, but now I cannot reproduce it. I probably just saw what I wanted to see.