CARP Primary\Backup with IPSec VPN failover
I believe I've identified a bug with the current implementation of CARP Primary\Backup with IPSec VPN failover.
I first began to suspect that something was wrong when traffic stopped transmitting\receiving for one of our P2 (phase 2) tunnels after over a week, there were other P2 tunnels under the same P1 (phase 1) tunnel which could transmit\receive data so initially I suspected it could be a routing issue however when the P2 timer expired on this tunnel (1 hour later) the problem was resolved. The following week we had a repeat of the same situation this was under the same P1 tunnel but a different P2 tunnel was affected, unfortunately this second occurrence was particularly service affecting as it happened inside working hours, we dropped the P2 tunnel and it automatically re-established resolving the problem.
After further investigation I found that the backup pfSense firewall had been attempting to establish the same P1 IPsec VPN tunnel which had stopped transmitting P2 data around the same time this occurred on both occasions. I placed a block LAN firewall rule on the backup pfSense firewall with logging to confirm if there was a device inside the LAN using the backup pfSense firewall IP address to attempt to access the remote P2 subnets and found that this was the case. The other side of the VPN would have been receiving attempts to negotiate with both the primary and backup pfSense firewalls which I believe resulted in the P2 tunnel being negotiated incorrectly resulting in an up state but no traffic.
I appreciate that we shouldn't use the backup firewall IP address as a route\gateway but how could we prevent pfSense from attempting to establish an IPSec VPN when it's using a virtual CARP IP as the interface. CARP is in backup mode on the backup and master mode on the primary, surely it shouldn't attempt to establish the VPN unless CARP is in master mode?
It's not a bug. It's working as expected in this case, doing exactly what it was told to do. The pinger process on the firewall is smart enough not to transmit the keep-alives if it's a CARP backup but if it receives traffic in the way you describe, it'll try to establish.
The only other way around that would be to keep the IPsec daemon shut down while in a CARP backup state and that's a bit harsh.
We managed to track down the devices which were routing to the backup, they had old static routes (before we used pfSense).
It would be good if there was a more practical solution, in large environments with lots of site to site traffic problems like this can be very detrimental and it's not always straight forward to find the source if you have a rogue device.
We previously used Cisco 7200's with HSRP, when an IPsec tunnel is configured with HSRP it will never attempt to establish if the HSRP group is in standby mode. It would be good if there was an option to not establish an IPsec tunnel when it's assigned to a CARP interface which is in backup mode, is there any chance of this being added as a feature in the future?
I've done some digging, looks like a similar problem was reported roughly 5 years which was apparently resolved 2 years ago in version '2.2'.
Although our IPsec tunnel isn't initiating all the time as described here, it is initiating when a device on the LAN sends a packet to the backup if it's destined for the remote subnet. If you have hundreds or thousands of devices and there are some devices which you have no management over then this kind of problem can soon become a serious issue.
My understanding of CARP is that the primary and failover devices persistently communicate with the each other (in a similar fashion to HSRP) to determine which will be the master and backup, so it would be nice if we had a process to check the CARP status before initiating the IPsec tunnel should it be assigned to a CARP interface.
That ticket was referring to the keep-alive pinger process, which is what I already mentioned.
The two systems check heartbeats but that's at a completely different level than IPsec.
For 99.9 of people it works fine as-is. For someone with a misconfigured network it'll have a problem, like you had, but there is so little benefit to "solving" this corner case it's just not worth doing. It could negatively impact cases that are working fine now.