Change pfsense to sliently drop resets for traffic destined to known VPN dest
nzkiwi68 last edited by
Consider a pair of clustered firewalls, site1 location has 2 firewalls site1a and site1b and the remote site2 contains 2 firewalls site2a and site2b.
All firewalls have IPSEC bound to a WANgroup with WAN and WAN2 for WAN failover. Imagine the subnet x.x.x.x/24 is LAN subnet at site2 reachable from site1 via an IPSEC tunnel.
Now the situation is occurs that if a failover occurs at the site1 end, of virtually any description, WAN failover to WAN2 at site1, site1a firewall fails and site1b firewall takes over, etc, state RESET and just as bad "destination host unreachable" messages get sent to the site1 clients from the firewall in control of the site1 LAN carp address whist the IPSEC tunnel gets established.
Let's say for example, that at site1, the main firewall site1a suffers a horrible hardware failure and the clustered HA firewall site1b must take over.
The problem is that during the time that CARP takes over and IPSEC comes up on the HA firewall site1b, and the tunnels are re-established, any traffic bound for tunnel endpoints gets sent back a RESET from firewall site1b.
This causes lots of problems
TELNET sessions from clients at site1 to servers at site2 get disconnected and reset during the brief outage if they send any traffic at all to the remote site2 server as firewall site1b sends back RESETS to the site1 client
Firewall 2 at site1, firewall site1b seems to maintain in it's local state table failures, for example, you are pinging a remote site2 host x.x.x.1. As the failover starts, you get back "request timed out" (normal, of course). Sometimes you get back "Destination host unreachable" - bad. When that happens, even when the IPSEC tunnel comes up on firewall site1b, you still cannot ping x.x.x.1. You KNOW the tunnel is up, because you can see it connected in the GUI status section on firewall site1b. Further, you can actually ping a different IP address down the tunnel x.x.x.2 from the same cleint, but you cannot ping x.x.x.1 for many minutes, up to 3 minutes until I assume some cached state in firewall site1b expires
Sometimes even though the client at site1 only got "Request timed out" - still even when the IPSEC tunnel comes up on firewall, you still cannot ping x.x.x.1. It can take a few seconds to several minutes for x.x.x.1 to start answering pings
Essentially it ruins the lovely HA shared state table, because the RESETS get back to the client and the firewall taking over caches failures and the failover then becomes is far from seamless
The faster the takeover, the less this is a problem. 2 missed pings during failover seems to result in 0 or only 1 or 2 telnet sessions that disconnect because they were unlucky enough to be trying to send data over that brief period. A longer outage of say 10 pings on a busy VPN is carnage resulting on most TCP state lost and a fair portion of the telnet sessions can't reconnect for many minutes even though the tunnel has come up on the second firewall because I assume of the failure state caching.
I've used a lot of other commercial firewall products and the various ones I'm thinking of don't seem to have this issue. Thinking about one firewall product in particular, I've never seen a "Destination Host Unreachable" reply for traffic from site1 destined for site2 over the VPN whilst the VPN was down. It seems to be that they cleverly never send back RESETS but just simply silently drop the traffic if it was destined for a VPN endpoint.
Possibly part of the problem?
Ensure that HA firewalls never state sync failed states for phase 2 source network to phase 2 destination network
Feature request - the Solution
I believe pfsense should:
On tunnel up, flush any failed states to the known VPN tunnel phase 2 endpoints
Ideally, NEVER send back RESETS for traffic that is destined to a VPN endpoint, ever, regardless of the tunnel up or down
If those 2 things were implemented, failover would become a lot more seamless and TCP sessions would not be lost.
nzkiwi68 last edited by
No comments at all?
NOBODY has ever met this issue, seeing TSP resets during a failover and state lost?