Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    Spontaneous Failover?

    Scheduled Pinned Locked Moved HA/CARP/VIPs
    19 Posts 3 Posters 5.5k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • M
      mcampbell
      last edited by

      Under what circumstances would a CARP suddenly fail over to the other node, and stay there?

      I have a dual setup CARP cluster, and it just switched over to the slave, despite it being up and running, still accessible, and even got into the web interface just fine.  The master was showing up as backup in the CARP status page, but as far as I could tell, no reason for it, it just switched over, and the only way I could find to switch it back, was by rebooting the master node, even though everything else seemed fine.

      Any ideas?

      1 Reply Last reply Reply Quote 0
      • M
        mcampbell
        last edited by

        Since no one seemed interested in answering that question, I'll pose another one:

        pfsense01 & pfsense02 are respectively in a master-backup configuration, with pfsync, xmlsync, CARP VIPs for LAN, WAN, & WAN2 (with respective configurations listed below), & a dedicated NIC for CARP traffic.  Assuming that both are functioning properly, is pfsense01 ALWAYS going to be the master of the CARP VIPs, or can it switch to the other one without it necessarily meaning that there's something wrong with the two nodes?

        | | LAN |
        | Settings | pfsense01 | pfsense02 |
        | IP Address | 10.1.1.1/23 | 10.1.1.1/23 |
        | VHID Group | 2 | 2 |
        | Advertising Frequency Base | 1 | 1 |
        | Advertising Frequency Skew | 0 | 100 |
        | | WAN |
        | Settings | pfsense01 | pfsense02 |
        | IP Address | 208.x.x.170/29 | 208.x.x.170/29 |
        | VHID Group | 3 | 3 |
        | Advertising Frequency Base | 1 | 1 |
        | Advertising Frequency Skew | 0 | 100 |
        | | WAN2 |
        | Settings | pfsense01 | pfsense02 |
        | IP Address | 71.x.x.18/29 | 71.x.x.18/29 |
        | VHID Group | 4 | 4 |
        | Advertising Frequency Base | 1 | 1 |
        | Advertising Frequency Skew | 0 | 100 |

        1 Reply Last reply Reply Quote 0
        • J
          jasonlitka
          last edited by

          Did you ever figure this out?  I'm having the same issue on my systems at the office.  Every few days I'll notice that everything is running from the backup box.  Disabling and enabling CARP on the first box, or simply rebooting the first box, will shift everything back over.

          I can break anything.

          1 Reply Last reply Reply Quote 0
          • M
            mattb253
            last edited by

            what are the actual interface IPs?

            1 Reply Last reply Reply Quote 0
            • M
              mcampbell
              last edited by

              I never did figure this out, and my work around is the same as yours, Jason.  But that's not a very good workaround.

              Matt, my apologies, I didn't realize I put this up without the actual interface IPs.  Here's a revised chart:

              LAN
              Settings pfsense01 pfsense02
              CARP IP Address 10.1.1.1/23
              IP Address 10.1.1.2/23 10.1.1.3/23
              VHID Group 2 2
              Advertising Frequency Base 1 1
              Advertising Frequency Skew 0 100
              WAN
              Settings pfsense01 pfsense02
              CARP IP Address 208.x.x.170/29
              IP Address 208.x.x.171/29 208.x.x.172/29
              VHID Group 3 3
              Advertising Frequency Base 1 1
              Advertising Frequency Skew 0 100
              WAN2
              Settings pfsense01 pfsense02
              CARP IP Address 71.x.x.18/29
              IP Address 71.x.x.19/29 71.x.x.20/29
              VHID Group 4 4
              Advertising Frequency Base 1 1
              Advertising Frequency Skew 0 100

              1 Reply Last reply Reply Quote 0
              • J
                jasonlitka
                last edited by

                I'm running about 25 IPs with CARP.  All are set to 1/0 on the master and 1/100 on the backup, just as yours are.

                I'm not showing any downtime in OpManager from either the pfSense boxes or from the switches (a stacked pair of Dell 6248) so I'm really not sure what is causing it.  If there was a momentary glitch I'd have thought that as soon as the glitch was over that the IPs would switch back to the primary.

                I can break anything.

                1 Reply Last reply Reply Quote 0
                • M
                  mcampbell
                  last edited by

                  I monitor my setup with Nagios, and it's not reported any glitches either.  I've looked in the logs as well, but I've been unable to find anything (though it's possible I'm looking in the wrong spot, as there isn't a tab specifically for CARP).

                  I also would have thought it would switch back, but that's definitely not the observed behavior.

                  1 Reply Last reply Reply Quote 0
                  • J
                    jasonlitka
                    last edited by

                    I've spent the last hour or so digging through my syslog server (Kiwi sucks, BTW) and I found that last Friday morning (05:08:04) I had a ton of:

                    192.168.1.252 - kernel: wan_vip18: link state changed to DOWN
                    192.168.1.252 - kernel: wan_vip18: MASTER -> BACKUP (more frequent advertisement received)
                    192.168.1.253 - kernel: wan_vip18: link state changed to UP
                    192.168.1.253 - kernel: wan_vip18: BACKUP -> MASTER (preempting a slower master)

                    This happened within seconds of a (and this may be unrelated) cron job which ran /usr/local/bin/vnstat -u (05:08:00).  Incidentally, that command doesn't actually exist.  I'm going to remove it from cron to see if the issue goes away but my hopes aren't high.

                    I can break anything.

                    1 Reply Last reply Reply Quote 0
                    • M
                      mcampbell
                      last edited by

                      Good luck Jason.  How often does it happen to you?  It's been real sporadic for me (1-2/month) (but seemingly always at the worst possible time), so I don't even have any recent logs to look at right now.

                      I looked in /etc/crontab on pfsense01, and didn't see anything like that entry in mine.  This is the entirety of my crontab:

                      0       *       *       *       *       root    /usr/bin/nice -n20 newsyslog
                      1,31    0-5     *       *       *       root    /usr/bin/nice -n20 adjkerntz -a
                      1       3       1       *       *       root    /usr/bin/nice -n20 /etc/rc.update_bogons.sh
                      */60    *       *       *       *       root    /usr/bin/nice -n20 /usr/local/sbin/expiretable -v -t 3600 sshlockout
                      1       1       *       *       *       root    /usr/bin/nice -n20 /etc/rc.dyndns.update
                      */60    *       *       *       *       root    /usr/bin/nice -n20 /usr/local/sbin/expiretable -v -t 3600 virusprot
                      30      12      *       *       *       root    /usr/bin/nice -n20 /etc/rc.update_urltables
                      0       */24    *       *       *       root    /etc/rc.backup_rrd.sh
                      0       */24    *       *       *       root    /etc/rc.backup_dhcpleases.sh
                      
                      
                      1 Reply Last reply Reply Quote 0
                      • J
                        jasonlitka
                        last edited by

                        About the same.  A couple times per month.  I typically don't notice until someone comes to me and says something isn't working.  Usually an IPSec tunnel (when you change something on the master it replicates to the backup but without restarting the services the config isn't applied).

                        I can break anything.

                        1 Reply Last reply Reply Quote 0
                        • M
                          mcampbell
                          last edited by

                          Heh, that is precisely how I find out too :)  In my case, the first time it happened, I did see problems with our IPSec Site-to-Site VPN connection, but after that first time, it's worked fine in subsequent failovers.  After that, it's historically been PPTP connections not being able to connect outside of the office (e.g., VPN user connects to the office, can connect to anything they want in the office, but then can't connect to anything outside of the office).  Everything else works fine, and even people inside the office don't notice any problems, just VPN users.

                          I have been working on integrating a Nagios NRPE plugin I found that will tell me whether or not pfsense01 is the master in the CARP layout, hopefully giving me the edge in tracking this mug down.

                          1 Reply Last reply Reply Quote 0
                          • J
                            jasonlitka
                            last edited by

                            How many CARP IPs do you have setup?  Just the 3 in your original post?

                            Under the assumption that maybe the volume of CARP traffic was causing issues I trimmed down my config from 58 CARP IPs to 6 + 52 IP Aliases.  I'm not sure if this will make a difference, but it sure speeds up the failover from one node to the other.

                            I can break anything.

                            1 Reply Last reply Reply Quote 0
                            • M
                              mcampbell
                              last edited by

                              Yeah, just the 3 (+ the dedicated CARP interface that I forgot to mention).

                              58 CARP IPs?  Wow, it's good to know that pfSense can handle that many.  I can see why it would speed that up; I've gathered that a comparatively large amount of stuff takes place in the hand off between nodes, and doing it 58x no doubt takes a bit of time.  With just my 3, I've not seen any noticeable delay–heck, were it not for the issues I mentioned with the PPTP server, I might never have noticed it switched (didn't have any noticable downtime when it switched for no apparent reason).

                              1 Reply Last reply Reply Quote 0
                              • M
                                mcampbell
                                last edited by

                                Is there anyone else who might have had any luck or insight into this issue?  Months later, and I'm still not much closer to solving the issue than when I first posted the topic.  I've ended up rebooting pfsense01 anywhere from once every couple of weeks to a couple of times a week.  I'd really be interested in anyone's theories on what the problem might be….

                                1 Reply Last reply Reply Quote 0
                                • J
                                  jasonlitka
                                  last edited by

                                  My problem went away when I switched the bulk of the IPs over from CARP to IP Aliases on a single CARP per interface.  The failover, when necessary, happens much faster as well.

                                  I can break anything.

                                  1 Reply Last reply Reply Quote 0
                                  • M
                                    mcampbell
                                    last edited by

                                    Interesting in your case, I'm glad you got your problem solved.  But in my case, I've only got one IP assigned per CARP interface, with the CARP IPs already virtualized by necessity, so I'm essentially the same as you (minus 50 IP addresses).

                                    It seems like we had all the same symptoms, but a different (but probably related) cause.  I guess you never did figure out what caused it?

                                    1 Reply Last reply Reply Quote 0
                                    • J
                                      jasonlitka
                                      last edited by

                                      Have you tried different cables, NICs, and/or switches?

                                      I can break anything.

                                      1 Reply Last reply Reply Quote 0
                                      • M
                                        mcampbell
                                        last edited by

                                        It's a fair point, but which one do I change?  When it does its failover, they all failover, even though they're all on different cables, NICs, & switches.  (For the record, this is a production system, so I can't just go mucking about doing trial-n-error steps).  The dedicated CARP interface is only a crossover cable going between the two interfaces, so no switch to worry about.

                                        The NICs are far more challenging to change out though, as each node is a mini-itx set up with 5 onboard NICs.

                                        1 Reply Last reply Reply Quote 0
                                        • J
                                          jasonlitka
                                          last edited by

                                          You can probably swap out the cables without anyone noticing.  Do the backup box first, then disable CARP on the primary and change those too.

                                          If your NICs are all built in then I'd probably go to the switch next.  You may just have to declare a maintenance window on that one.

                                          I can break anything.

                                          1 Reply Last reply Reply Quote 0
                                          • First post
                                            Last post
                                          Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.