MultiWAN failover is dropping unrelated connections

lieutdan13

We have a relatively simple network setup.

WAN1 - Primary ISP, used primarily for connections initiating internally
WAN2 - Secondary ISP, VPN to a co-location, used mostly for publicly available, hosted services (mail, web, etc.)
LAN1 - Servers
LAN2 - workstations

WAN1 is configured to failover to WAN2 using a Gateway group, with WAN1 being Tier 1 and WAN2 being Tier 2.

The issues I'm having and have not been able to resolve is that whenever WAN1 has an outage, the failover seems to work correctly. However, it all connections are dropped on all interfaces. The VPN connection goes down on WAN2 (WAN2 connection stays up the entire time) and connections between LAN1 and LAN2 are also dropped (Samba shares, network printing, etc).

I don't know if is caused by an CPU spike and the firewall not being able to hold the connections or if this is something that is done by design. I am willing to debug and troubleshoot as much as possible to resolve this. Please let me know if there is any additional information that is needed to help me troubleshoot.

Thanks in advance.

EDIT:
Running 2.1.4-RELEASE (amd64)

EDIT2:
Just to clarify, the connections between LAN1 and LAN2 are dropped, but can be re-established after the failover occurs. Also, once WAN1 comes back up, the Tier 1 is used again and the connections between LAN1 and LAN2 are dropped again.

lieutdan13

After doing some research, it looks like this is due to the fact that the entire State table is completely cleared after the a change in any WAN interface in a failover, regardless of the interfaces that are being used in the connection in the state table. I understand that this is necessary in some (most?) circumstances, but I feel that this behavior is not very flexible and produces undesired behavior.

Can anyone shed some light as to why the state table cannot keep track of the interfaces involved in the state and only clear those states when the interface status changes?

jimp

System > Advanced, Misc tab, check the box to disable state killing on gateway failure.

The problem isn't that the states can't be tracked by interface, it's more subtle than that. We had tried killing per interface on 2.0.x but it misbehaved in other ways.

The issue is that there are two states for each connection - one as it enters the firewall, one as it leaves.

e.g. In on LAN, x.x.x.x to y.y.y.y, and out on WAN1 x.x.x.x. NAT to z.z.z.z to y.y.y.y

If you kill the WAN1 states, the LAN state is still there. If you kill the LAN and WAN1 states, it's still disruptive. The LAN state doesn't know the ultimate destination of the traffic, so it can't be targeted for clearing accurately in an efficient way. (Sure you could try to map all of the connections from the state table output but that would require lots of extra manual processing and does not scale well…)

We still have a ticket open somewhere to deal with that properly in the future, but until we get to that stage, the disruption is the only way to deal with it.

lieutdan13

jimp Thank you for your response and for the clarification. The current solution to drop all states makes complete sense dues to the current Pfsense limitations.

If I check the "disable state killing on gateway failure" box, what are the implications?
How will this affect the states of the existing connections of the WAN that has gone down?
Will the current connections drop and states clear after a certain amount of time?
Hypothetically (perhaps over simplified), if I've have made a request to site A, then WAN1 goes down, my browser will eventually timeout. If I refresh the page, will my computer establish a new connection through the failed over WAN2, or will the stale state be reused?

jimp

New connections would flow via the newly activated WAN, old connections would be left alone.