CARP didnt failover after wan became offline



  • Hi,

    Does anyone know why my setup didnt fail over to the other node in the HA cluster when the WAN became offline,

    I failed it over manually, and everything went fine, restarted the primary node and the gateway came back online.
    I thought there was some algorithm that would fail over the VIP's to another node of the GW was down?

    Any ideas?

    0_1541258503301_WAN-DOWN.png


  • Netgate

    CARP/HA does not failover on a Gateway Failure like that. Multi-WAN does.

    HA would fail over on a down interface.

    You should have the same WANs on both cluster nodes so a failure on one should also be a failure on the other. If that is not the case you need to figure out why the WAN is failing on that node and fix it.



  • Hi,

    Thank you for your reply, turns out I put a /32 in the CARP VIP, not a /24 (which is what the network is) hence why it didn't failover to the 2nd node. ( I've tested this in a lab and can confirm the same symptoms in the scenario)

    My only concern is that should one of the interfaces on the Vswitch stop forwarding traffic (so technically not a link down state) is there a way the process could check to see if the other node in the cluster can ping the same gateway and if it can, cause a complete failover to the secondary node, or even just failover blindly.

    On physical hardware i don't think this would be an issue unless you experience a UDLD issue on fiber as an example, however, on virtual instances, there's a Vswitch to contend with before a physical switch.

    Is there any way to cause a full failover if the gateway becomes offline on the primary? rather then just the WAN interface failing over to the secondary, leaving the LAN on the primary.

    *you may be asking why a link would stop forwarding traffic, well on our setup I *think there's an issue with the drives its using, its setup to use the realtek drives, however, it's on a QEMU KVM host, I have changed to now use VIRTIO drives as we could only get about 3 days worth of uptime before traffic would stop flowing, I changed it on one interface to virtio and the issue hasn't come back since, so I've changed it on all the interfaces, fingers crossed we can get more than 3 day uptime.


  • Netgate

    pfSense HA/CARP is a Layer 3 failover system designed to failover in most router failure scenarios. Your vSwitch not passing traffic but having link up would be a layer 2 failure. You would need to build in redundancy at layer 2 for that.