Navigation

    Netgate Discussion Forum
    • Register
    • Login
    • Search
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search

    Pfsense Failover drops connections/ interuppted on Restart of Primary

    General pfSense Questions
    2
    13
    503
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • A
      AcaaliK last edited by

      Hello,

      I have two pfSense boxes operating in a failover type setup – identically connected, syncing rules and connection states and sharing CARP IPs.

      I have tested failover before successfully by rebooting the primary pfsense box.

      But recently, I rebooted my primary firewall box. As expected, all states were sync'd to the backup, and everything continued without a hiccup -- SSH connections didn’t drop and the users to the didn't experience any drops. The CARP vips flipped over and the backup picked right up where the primary left off.

      However, when the primary restarted, I lost connections to some things. Some long running connections stayed up, but many of the more active ones went down.

      I'm guessing that the CARP state reverts prior to PFSYNC taking care of syncing connection states.

      I have checked the configuration on both boxes ensured sync states is enabled on both boxes, checked the carp vips, the outbound NAT is using a CARP VIP but I can’t find anything to cause this behaviour especially since it the failover has worked seamlessly before.

      Is this the experience that someone has had and is there a way to correct it?

      Thank you

      1 Reply Last reply Reply Quote 0
      • stephenw10
        stephenw10 Netgate Administrator last edited by

        It should not behave like that. If you have state sync configured correctly it syncs states both ways so that states created on the secondary when it was master should be on the primary when it fails back. It looks like that is working to some extent otherwise no states would exist on the primary after it was rebooted.
        Check the interface order in the config files is identical.
        Check for any errors on the sync interface.

        Steve

        1 Reply Last reply Reply Quote 0
        • A
          AcaaliK last edited by

          Hi @stephenw10 I have double checked sync states is enabled on both firewalls and each unit has it's opposite number's IP address on the sync interface.

          I have checked the interface order and it is exactly the same both on the units; also checked the xml config downloaded from each unit.

          The sync interfaces has no errors went to Status -> Interfaces and below is screenshoot:

          87cfaffb-0228-4a01-900c-367d4d318e9b-image.png

          Please let me know if they are any other checks I can perform.

          Thank you

          1 Reply Last reply Reply Quote 0
          • stephenw10
            stephenw10 Netgate Administrator last edited by stephenw10

            Does it show approximately the same number of states on both nodes?

            An interesting test, if you're able to do it, would be to set the primary in maintenance mode and then reboot it. It should still be in maintenance mode after that. Bring it back to normal mode after a few minutes and see if it does the same thing. I'm wondering if the CARP is switching back to the Primary before the states have sync'ed somehow.

            Steve

            A 1 Reply Last reply Reply Quote 0
            • A
              AcaaliK @stephenw10 last edited by

              @stephenw10 Hi sorry for the delayed response, was waiting on a maintenance window to try your suggestion. I can confirm the states are the same on both main and standby or at least within a certain range, as they keep changing.

              I put the main firewall in maintenance mode and rebooted and we didn’t experience any interruptions. Definitely looks like the main unit becomes the master before all the states are synced from the backup after the mains restart.

              Please advise on the next steps.

              Thanks a lot for the help.

              Regards

              1 Reply Last reply Reply Quote 0
              • stephenw10
                stephenw10 Netgate Administrator last edited by

                So to be clear you were able to bring the Primary out of maintenance mode some time after booting and it failed back without loosing connections?

                Steve

                A 1 Reply Last reply Reply Quote 0
                • A
                  AcaaliK @stephenw10 last edited by

                  @stephenw10 Hi Stephen, I confirm that is correct.

                  Thank you

                  Regards

                  1 Reply Last reply Reply Quote 0
                  • stephenw10
                    stephenw10 Netgate Administrator last edited by

                    Hmmm, I've never seen that. How many states would you have open typically when you do this?

                    A 1 Reply Last reply Reply Quote 0
                    • A
                      AcaaliK @stephenw10 last edited by

                      @stephenw10 Hello Stephen, about 36000 states.

                      Regards

                      1 Reply Last reply Reply Quote 0
                      • stephenw10
                        stephenw10 Netgate Administrator last edited by stephenw10

                        Hmm, that's not a huge number. I'll see if I can find anything about this.

                        1 Reply Last reply Reply Quote 0
                        • stephenw10
                          stephenw10 Netgate Administrator last edited by

                          Hmm, looks like it's this: https://redmine.pfsense.org/issues/2218

                          Clearly almost nobody hits that, I've never seen it and that ticket is 7 years old!

                          Do you have packages installed that might be delaying the state sync as it mentions there?

                          Thanks,
                          Steve

                          A 1 Reply Last reply Reply Quote 0
                          • A
                            AcaaliK @stephenw10 last edited by

                            @stephenw10 Wow!! been scratching my head over this for awhile. Good to know it’s a known issue, was contemplating rebuilding the units from scratch.

                            I have only snort and HA proxy enabled, I saw a mention of HA proxy in the link you provided.

                            I do have plans of enabling other services like OpenVPN in the near future.

                            I hope a solution can be found especially for the unplanned shutdowns due to power for example, as a have autorestart on power restoration enabled for the units.

                            Thanks for all the help and support. I truly appreciate.

                            1 Reply Last reply Reply Quote 0
                            • stephenw10
                              stephenw10 Netgate Administrator last edited by

                              Mmm, hard to see what we can do here without patching something quite low level.

                              Ideally we would want it to remain in CARP maintenance until the states have syncd. That would probably need to be selectable though as some people will not be syncing states.

                              We could probably force the Primary to boot into maintenance mode at every boot requiring manual intervention to failback. It would still failback automatically if the secondary went off-line entirely. Would that be in any way practical for you?

                              Steve

                              1 Reply Last reply Reply Quote 0
                              • First post
                                Last post