IPSEC carp failover doesn't automatically establish on switched to member

adam65535

2.4.5 p1 -> 2.6.0 (edit: changed from 2.4.3 p1 to 2.4.5 p1)

IPsec would always re-establish on 2.4.5 p1 but after the upgrade to 2.6.0 (fresh install importing old config.xml), If I force a failover between members by entering persistent carp maintenance mode, the member that becomes MASTER doesn't establish IPSEC. Status just shows tunnels disconnected on the secondary firewall that is switched to. The primary member still shows the tunnels as established. I left it that way for a few minutes each time I tested this and it never re-established on the switched to member. This worked within seconds on the previous version. If I restart IPSEC on the secondary member, It will re-establish IPSEC without issue but it doesn't happen on CARP failover.

Member1 (put in persistent carp maintenance mode)
Apr 12 08:27:04 charon 54131 08[KNL] interface ovpns1 deactivated
Apr 12 08:27:04 charon 54131 08[KNL] 10.0.x.129 disappeared from ovpns1
Apr 12 08:27:03 charon 54131 14[KNL] 15.x.x.130 disappeared from bge0
Apr 12 08:27:03 charon 54131 14[KNL] 10.1.x.1 disappeared from bge1

Member2 (switched to this member)
Apr 12 08:27:03 charon 98764 08[KNL] 10.1.x.1 appeared on bge1
Apr 12 08:27:03 charon 98764 08[KNL] 15.x.x.130 appeared on bge0
Apr 12 08:27:05 charon 98764 10[KNL] interface ovpns1 activated
Apr 12 08:27:05 charon 98764 08[KNL] 10.0.y.129 disappeared from ovpns1
Apr 12 08:27:05 charon 98764 08[KNL] interface ovpns1 deactivated
Apr 12 08:27:05 charon 98764 01[CFG] ipseckey plugin is disabled
Apr 12 08:27:05 charon 98764 01[CFG] loaded 0 entries for attr plugin configuration
Apr 12 08:27:05 charon 98764 01[CFG] loaded 0 RADIUS server configurations
Apr 12 08:27:05 charon 98764 05[CFG] loaded IKE shared key with id 'ike-0' for: '%any', '16.x.x.1'
Apr 12 08:27:05 charon 98764 05[CFG] loaded IKE shared key with id 'ike-1' for: '%any', '124.x.x.1'
Apr 12 08:27:05 charon 98764 05[CFG] loaded IKE shared key with id 'ike-2' for: '%any', '110.x.x.1'
Apr 12 08:27:05 charon 98764 10[CFG] updated vici connection: bypass
Apr 12 08:27:05 charon 98764 06[CFG] updated vici connection: con1
Apr 12 08:27:05 charon 98764 06[CFG] installing 'con1_1'
Apr 12 08:27:05 charon 98764 06[CFG] installing 'con1_3'
Apr 12 08:27:05 charon 98764 06[CFG] installing 'con1_6'
Apr 12 08:27:05 charon 98764 06[CFG] installing 'con1_8'
Apr 12 08:27:05 charon 98764 06[CFG] installing 'con1_11'
Apr 12 08:27:05 charon 98764 06[CFG] updated vici connection: con3
Apr 12 08:27:05 charon 98764 06[CFG] installing 'con3_9'
Apr 12 08:27:05 charon 98764 06[CFG] installing 'con3_10'
Apr 12 08:27:05 charon 98764 06[CFG] updated vici connection: con2
Apr 12 08:27:05 charon 98764 06[CFG] installing 'con2_2'
Apr 12 08:27:05 charon 98764 06[CFG] installing 'con2_4'
Apr 12 08:27:05 charon 98764 08[KNL] interface ovpns1 activated
Apr 12 08:27:05 charon 98764 08[KNL] 10.0.y.129 appeared on ovpns1

If I don't restart IPSEC on the switched to secondary member and I just switch back to the previous member by taking the primary out of persistent maintenance mode, the IPSEC tunnel starts working immediately back on the primary. It is like IPSEC completely ignores the carp failover.

adam65535

I forgot to mention that the remote side of the tunnels are Checkpoint firewalls.

adam65535

More info...

Not using dead-peer detection. Not using dual wan. Just a single ISP wan link. Wan, Lan, Sync interfaces being used on the firewall. Primary is set to sync all changes to secondary and that works as changes get synced. I am using IKEv1 main mode shared key and PFS on phase 2.

Should the SAD be synced all the time to the secondary to prepare for takeover? It is always empty on the secondary looking at the Status->IPSEC gui. After the switch, SAD/SPD are empty on the secondary when it takes over also. Restarting IPSEC on secondary does cause it to work as mentioned before and then SAD/SPD is populated in the GUI. Tunnels work fine after that.

adam65535

I finally had a chance to do more testing on this with debug set to Diag for IDE SA, IKE Child SA, and Configuration Backend. It failed over just fine and quickly. I failed back to the other firewall and that worked too.

I will have to do more testing another day to verify but whatever caused it seems to not be an issue now for some reason. The only thing I did was reconfigure OpenVPN for different usage and enable debug on IPSEC. Maybe just a coincidence though.

dognose

Hey there,

i'm currently seeing the same issue on my (fresh) HA-Setup. Not going to Hijack your thread, but maybe can contribute something, so either of us can find sth. out.

A thing I noted: during failback, I have noted that the auxiliary node is keeping the connection established for quite some time.

As a temporary workaround, I changed the Connection Mode for the SAs from "Always" to "On Demand".

Now - when failing back - the tunnel stops working as long as the auxiliary node is keeping its connection - or it looks like that, the tunnel is actually working, read on. If I then initiate a reboot on the backup node, the connection on the master is reestablished.

So, on failOVER it works right away - second node notices a demand for the connection and (re)establishes the whole tunnel.

on failBACK the second node keeps it connection , but ofc isn't usable because the second node is no longer in possession of the lan-carp-address.

Meaning: On failBACK, the backup node stays connected to the IPSec tunnel, but the ha-cluster performs a failBACK on the LAN-CARP-IP only.

I confirmed from the far end of the tunnel, that the connection is still up, and the node holding the backup-role (Again) is still answering to ping requests send to the respective ip through the tunnel.

That lead me to the "second" Workaround: I have an external watchdog running, and whenever I notice the failback-situtation, I force a reboot on the node holding the backup role, so the master can reestablish the IPSec connection.

Just not sure, if this is a bug, or wrong configuration in a tiny detail.

dognose

Just did another test:

While beeing in this stituation:

failback on the lan-carp-ip happened to node1
node2 has the IPSec still established

I can continue using the tunnel, if I manually change my gateway from the lan-carp-ip to the second nodes ip address.

So, overall the master node does not reestablish a connection, because the connection is healty - but it is just no longer accessible for lan-clients.

However, the roles themself claimed that fallBACK also has happened for the wan-carp-ip, so it might be an issue on the wan site, where packages of the tunnel communication are still send to the backup-node, even if it does no longe own the wan-carp-ip. This leads to the clusters assumption that the tunnel is healty and no reconnect is required.

But beyond that observation, I could only start to guess, because I'm not familiar to how the whole carp thing works. If it uses MAC-spoofing, there shouldn't be any missrouted packages. If both of the nodes use an own mac-address with the wan-carp-ip it might be the routers mac-address-table / cache that keeps sending packages to the MAC of the backup-role, keeping that tunnel alive and "healthy", which finally surpresses the reconnect of the master role, that would be the one that is accessible by the lan-carp-ip.