HA state failover issue on 24.11
-
Hello, we have 2 Netgate 8300 firewalls. This is a fresh install and configuration. I've setup HA without issues on our previous firewalls. I hadn't upgraded them to the 24.x branch before they were decommissioned and replaced with the current 8300s.
The issue is that when entering CARP maintenance mode on the primary firewall to fail over to the secondary firewall, all current firewall states out the WAN interface are no longer usable. New states out the WAN interface work fine. All states going from LAN to any of the OPT interfaces continue to work fine. We have publicly routable IPs on the WAN interface in the same /29 subnet, with a CARP IP in that subnet as well, and an outbound NAT rule that forces all LAN traffic to use that WAN CARP IP.
When leaving CARP maintenance mode on the primary, the same issue happens. All firewall states that were working on the secondary are then not usable on the primary, but new states work fine.
There is a dedicated sync interface between the two with pfsync configured to point to each other, and the primary syncs its config with XMLRPC to the secondary.
LAN and WAN both have a valid CARP IP within their respective subnets, and LAN clients are using the LAN CARP IP to reach the internet.
I've verified that the same CARP public IP address is seen by remote servers on the internet regardless if traffic if passing through the primary or secondary firewalls.
I've verified that all states exist in the Diagnostics State page on both firewalls.
I've done a traceroute through both primary and secondary, and the output looks identical.
All interfaces are setup the same exact way on both: same interface, same name, etc. But just in case, I've changed the Global Firewall State Policy from Interface Bound States to Floating States, and that had no affect with the issue.
The only relevant package installed is Suricata, and I've verified that the problem still exists when disabling Suricata.
When pinging a remote server on the internet from the primary, the state table looks like the following on both firewalls:
LAN icmp 10.0.5.111:53 -> 8.8.8.8:8 0:0 5 / 5 300 B / 300 B
WAN icmp x.x.x.x:53 (10.0.5.111:53) -> 8.8.8.8:8 0:0 5 / 5 300 B / 300 BAnd when failing over to the secondary, after maybe 30 seconds, I'll see the WAN entry expire and disappear, leaving only the LAN state. Again, new traffic to the internet will work normally, and existing traffic from LAN to an OPT interface (no outbound NAT) will work normally.
I'm running out of ideas.