Feature request: on startup make CARP start initially for 60s in temporary maintenance mode

nzkiwi68

The problem;
With HA failover clustered pfSense;

if the primary firewall reboots
the backup firewall takes over quickly (good)
But when the primary comes back, CARP starts up early in the startup sequence and usurps the backup fw (bad)
The fail back takes quite a while because the firewall was ready to start packages like FRR and they take a while to come up, much longer for than if the firewall was already running

Suggested Feature
Allow us to set a timer function under such as:

Always Startup CARP in temporary maintenance mode for xxx seconds

How would this work?
It would work great, because if only the master or backup firewall was running and no other firewall was present, even though during a reboot CARP would come up in temporary maintenance mode, it would still be instantly the CARP master because no other CARP was present.

If the backup was present and running, the primary then has xxx seconds (I'm think that would be set at 60 or even longer, 300/600/900 seconds) to allow the primary to come up, stabilize and then take over CARP.

Notes
I know, if it's a "structured reboot" you put the primary into persistent CARP maintenance mode, then restart, then once up and all stable, then leave persistent CARP maintenance mode and fail back cleanly and quite fast.

In a way, this feature mostly automates that.

It protects against a number of scenarios that I have encountered

Rebooting the primary without putting it into persistent CARP maintenance mode (oops)
HA primary unintended reboot (hardware or power failure)
Firewall unstable hardware fault, reboots, runs for 2 minutes, crashes, reboots again (a longer timer will catch an unstable HA firewall and not allow it to take over, you could have that timer quite high at 600 or 900 seconds)

Advanced Idea
You could also have a fail back time option;

Exit CARP temporary maintenance mode between hours xxxx - yyyy

That way, no interruption until the middle of the night. But if the backup failed, since no other CARP would be present, it would instantly take over.

JeGr

@nzkiwi68 said in Feature request: on startup make CARP start initially for 60s in temporary maintenance mode:

The fail back takes quite a while because the firewall was ready to start packages like FRR and they take a while to come up, much longer for than if the firewall was already running

I don't exactly know what your problem is with 3 and 4. Do you have problems when the master takes back it's services? We ware running a big cluster setup and have a multitude of VPN tunnels and OVPN Servers for remote access as well as Radius etc. and never had a problem with node1 (master) taking over after rebooting. What's your scenario that you'd find a timed take over so much faster/better?

dotdash

If your problem is with package behavior, (FRR?) perhaps it is better addressed as a feature request for FRR. Like JeGr, I haven't seen any problem with multi-wan clusters running standard functions such as IPSec and OpenVPN.

nzkiwi68

@JeGr

Scenario 1 - primary and backup already running
Reboot primary - backup takes over quickly

Scenario 2 - primary starting up and backup already running
Primary comes up, CARP takes over from backup firewall, then takes as long as 1-2 minutes for primary to take over FRR routing, VPNs etc.

Perhaps it is an FRR type issue...

JeGr

@nzkiwi68 said in Feature request: on startup make CARP start initially for 60s in temporary maintenance mode:

Primary comes up, CARP takes over from backup firewall, then takes as long as 1-2 minutes for primary to take over FRR routing, VPNs etc.

Nope, nothing the sort here and we run multiple packages on the nodes - but no FRR. OpenVPN, IPSEC, FreeRadius etc. have no problem whatsoever with primary coming back and taking over, seconds later the first VPN connections authenticating via FR are already connected again so I think that could very well point to FRR. As FRR (OSPF?) can take a bit to sort out any other peers, exchange routes etc. that could probably be the culprit - or something slowing the process down.

nzkiwi68

I still got this issue, now I can replicate it easily at 2 completely sites, all 2.4.4_p3 and both using;

FRR and OSPF
list itemHA pair
list itemIPSEC VTI tunnels bound to a CARP IP address
list itemFRR set to fllow the lan CARP address (so FRR off on the backup firewall)

Here's a continuous ping across the VPN from site A to site B.

Reply from 10.10.40.1: bytes=32 time=4ms TTL=253
Reply from 10.10.40.1: bytes=32 time=7ms TTL=253
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Reply from 10.10.40.1: bytes=32 time=4ms TTL=253
Reply from 10.10.40.1: bytes=32 time=3ms TTL=253
Reply from 10.10.40.1: bytes=32 time=4ms TTL=253
Reply from 10.10.40.1: bytes=32 time=3ms TTL=253

First timeed out, that's the primary firewall being rebooted, 4 pings lost and the backup completely takes over. Very acceptable. Excellent.

Now the slow bit... The primary comes up, CARP takes over and takes ages for things to settle and go online.

Reply from 10.10.40.1: bytes=32 time=3ms TTL=253
Reply from 10.10.40.1: bytes=32 time=17ms TTL=253
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Reply from 10.10.40.1: bytes=32 time=3ms TTL=253
Reply from 10.10.40.1: bytes=32 time=4ms TTL=253

After digging, I think the cause is the VPN, IPSEC, it's just not getting released from the backup firewall in a timely manner, it seems to hold on and on and on and keeps running IPSEC VPN tunnels. I can speed up the fail back by logging onto the backup firewall and in IPSEC status stopping the IPSEC tunnels.

I wonder if the issue is because my IPSEC tunnels are using a CARP IP address?