Pfsense Failover drops connections/ interuppted on Restart of Primary



  • Hello,

    I have two pfSense boxes operating in a failover type setup – identically connected, syncing rules and connection states and sharing CARP IPs.

    I have tested failover before successfully by rebooting the primary pfsense box.

    But recently, I rebooted my primary firewall box. As expected, all states were sync'd to the backup, and everything continued without a hiccup -- SSH connections didn’t drop and the users to the didn't experience any drops. The CARP vips flipped over and the backup picked right up where the primary left off.

    However, when the primary restarted, I lost connections to some things. Some long running connections stayed up, but many of the more active ones went down.

    I'm guessing that the CARP state reverts prior to PFSYNC taking care of syncing connection states.

    I have checked the configuration on both boxes ensured sync states is enabled on both boxes, checked the carp vips, the outbound NAT is using a CARP VIP but I can’t find anything to cause this behaviour especially since it the failover has worked seamlessly before.

    Is this the experience that someone has had and is there a way to correct it?

    Thank you


  • Netgate Administrator

    It should not behave like that. If you have state sync configured correctly it syncs states both ways so that states created on the secondary when it was master should be on the primary when it fails back. It looks like that is working to some extent otherwise no states would exist on the primary after it was rebooted.
    Check the interface order in the config files is identical.
    Check for any errors on the sync interface.

    Steve



  • Hi @stephenw10 I have double checked sync states is enabled on both firewalls and each unit has it's opposite number's IP address on the sync interface.

    I have checked the interface order and it is exactly the same both on the units; also checked the xml config downloaded from each unit.

    The sync interfaces has no errors went to Status -> Interfaces and below is screenshoot:

    87cfaffb-0228-4a01-900c-367d4d318e9b-image.png

    Please let me know if they are any other checks I can perform.

    Thank you


  • Netgate Administrator

    Does it show approximately the same number of states on both nodes?

    An interesting test, if you're able to do it, would be to set the primary in maintenance mode and then reboot it. It should still be in maintenance mode after that. Bring it back to normal mode after a few minutes and see if it does the same thing. I'm wondering if the CARP is switching back to the Primary before the states have sync'ed somehow.

    Steve



  • @stephenw10 Hi sorry for the delayed response, was waiting on a maintenance window to try your suggestion. I can confirm the states are the same on both main and standby or at least within a certain range, as they keep changing.

    I put the main firewall in maintenance mode and rebooted and we didn’t experience any interruptions. Definitely looks like the main unit becomes the master before all the states are synced from the backup after the mains restart.

    Please advise on the next steps.

    Thanks a lot for the help.

    Regards


  • Netgate Administrator

    So to be clear you were able to bring the Primary out of maintenance mode some time after booting and it failed back without loosing connections?

    Steve



  • @stephenw10 Hi Stephen, I confirm that is correct.

    Thank you

    Regards


  • Netgate Administrator

    Hmmm, I've never seen that. How many states would you have open typically when you do this?



  • @stephenw10 Hello Stephen, about 36000 states.

    Regards


  • Netgate Administrator

    Hmm, that's not a huge number. I'll see if I can find anything about this.


  • Netgate Administrator

    Hmm, looks like it's this: https://redmine.pfsense.org/issues/2218

    Clearly almost nobody hits that, I've never seen it and that ticket is 7 years old!

    Do you have packages installed that might be delaying the state sync as it mentions there?

    Thanks,
    Steve



  • @stephenw10 Wow!! been scratching my head over this for awhile. Good to know it’s a known issue, was contemplating rebuilding the units from scratch.

    I have only snort and HA proxy enabled, I saw a mention of HA proxy in the link you provided.

    I do have plans of enabling other services like OpenVPN in the near future.

    I hope a solution can be found especially for the unplanned shutdowns due to power for example, as a have autorestart on power restoration enabled for the units.

    Thanks for all the help and support. I truly appreciate.


  • Netgate Administrator

    Mmm, hard to see what we can do here without patching something quite low level.

    Ideally we would want it to remain in CARP maintenance until the states have syncd. That would probably need to be selectable though as some people will not be syncing states.

    We could probably force the Primary to boot into maintenance mode at every boot requiring manual intervention to failback. It would still failback automatically if the secondary went off-line entirely. Would that be in any way practical for you?

    Steve


Log in to reply