Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    CARP spontaneous failover

    Scheduled Pinned Locked Moved HA/CARP/VIPs
    11 Posts 3 Posters 7.3k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • S
      sullrich
      last edited by

      What do you mean by tuning recommendations?  Do you have polling turned on?  If so, try turning it back off and run the same carp debug methods.

      1 Reply Last reply Reply Quote 0
      • L
        lamont
        last edited by

        By tuning I mean that I followed the polling/tuning instructions here:

        http://wiki.pfsense.com/wikka.php?wakka=Tuning

        I ran for a while without polling and the poor sokeris boxes (soekris 4801) were unable to deal with the load.  If I disable polling, I believe the boxes will be unable to deal with more than about 4 mbit/s of traffic.  We had the same failover problem without polling enabled though.

        Somewhere else on the wiki I believe I saw that the wrap/soekris boards were capable of dealing with 20 mbit/s of traffic, so I didn't think our 8 or so would stress them too much.

        1 Reply Last reply Reply Quote 0
        • S
          sullrich
          last edited by

          If the boxes are underpowered they are underpowered.  My guess is that the tuning that you have done has somehow interfered with CARP advertising.

          1 Reply Last reply Reply Quote 0
          • C
            cmb
            last edited by

            I don't believe a 4801 should have any problem with < 10 Mb. They do seem to be slightly slower than WRAP boards for some reason, but not drastically.

            lamont: what happens when you think the boxes are overloaded?

            If they actually are overloaded, it's possible with the way they react when under heavy load that a traffic spike could cause CARP to not  send advertisements for 4 seconds.

            1 Reply Last reply Reply Quote 0
            • L
              lamont
              last edited by

              During periods of high load I see packet loss and jitter.  Right now, for example, my ping times to the internal CARP interface take from 0.5 ms to 80 ms, but when the box becomes highly loaded the pings can take 800 ms or drop around 5% of packets.

              Pinging through the box from an internal device to the external gateway I see:

              64 bytes from 208.96.47.161: icmp_seq=8 ttl=254 time=3.134 ms
              64 bytes from 208.96.47.161: icmp_seq=9 ttl=254 time=9.953 ms
              64 bytes from 208.96.47.161: icmp_seq=10 ttl=254 time=15.280 ms
              64 bytes from 208.96.47.161: icmp_seq=11 ttl=254 time=1029.436 ms
              64 bytes from 208.96.47.161: icmp_seq=16 ttl=254 time=6.252 ms
              64 bytes from 208.96.47.161: icmp_seq=17 ttl=254 time=10.174 ms
              64 bytes from 208.96.47.161: icmp_seq=18 ttl=254 time=7.520 ms
              64 bytes from 208.96.47.161: icmp_seq=19 ttl=254 time=3.830 ms
              64 bytes from 208.96.47.161: icmp_seq=20 ttl=254 time=5.219 ms

              That's just under high load.  but even under normal load, I still see the carp failover happen, and it takes a while to fail back.

              1 Reply Last reply Reply Quote 0
              • S
                sullrich
                last edited by

                Either way my suggestion is to beef up the primary carp box a bit.

                1 Reply Last reply Reply Quote 0
                • C
                  cmb
                  last edited by

                  what's your CPU and network usage when you start seeing loss and high jitter? CPU usage when polling is disabled, when it's enabled it's meaningless.

                  On Status, Interfaces, any errors or anything look out of the ordinary?

                  1 Reply Last reply Reply Quote 0
                  • L
                    lamont
                    last edited by

                    I'm afraid to disable polling as the box goes crazy with interrupts if I do.  The interfaces themselves look fine with no collisions, errors or otherwise.  netstat -m seems reasonable with no memory errors or other buffer considerations.

                    The only thing funny is the number of suspect or lost polls in the polling results avail from sysctl:

                    sysctl -a |grep polling

                    kern.polling.idlepoll_sleeping: 0
                    kern.polling.stalled: 5262
                    kern.polling.suspect: 15177118
                    kern.polling.phase: 0
                    kern.polling.enable: 1
                    kern.polling.handlers: 5
                    kern.polling.residual_burst: 0
                    kern.polling.pending_polls: 0
                    kern.polling.lost_polls: 66215948
                    kern.polling.short_ticks: 1
                    kern.polling.reg_frac: 50
                    kern.polling.user_frac: 5
                    kern.polling.idle_poll: 1
                    kern.polling.each_burst: 240
                    kern.polling.burst_max: 1000
                    kern.polling.burst: 26

                    According to a post from the developer (the only information I could find outside of the source itself) lost_polls are nothing to worry about, but suspect polls are issues that would have resulted in a deadlock earlier, but the developer put a check in to mark where a deadlock would have occurred, fixes it, and increments the suspect_polls counter.  I have no idea what it is really indicative of.  Obviously, as I'm using the soekris 4801 I'm using the SIS driver.  Of interest might be that I have the 5 SIS soekris board (3 onboard plus 2 hanging off a special built PCI card)

                    One issue we're looking at is the 10 mbit/s WAN uplink.  It's possible that the link is overloaded, causing retransmissions and buffers to fill up on the pfSense box itself.  We're going to  upgrade that to a 100 mbit/s link and see if that helps with some of our packet loss issues.  (we see 5 minute average usage as high as 8 mbit/s, and it stands to reason that if the 5 minute average was 8 mbit/s, then we attempted to exceed 10 mbit/s during that period)

                    1 Reply Last reply Reply Quote 0
                    • C
                      cmb
                      last edited by

                      Well, from the sounds of it, then it's probably pegged at 100% CPU if you turn off polling.

                      We haven't really done extensive testing on 4801's because we don't have many of them. Soekris hasn't contributed any hardware, but PC Engines has quite a bit so we use a lot of WRAP hardware. You seem to know what you're talking about, this seems to be some sort of legit scalability issue on 4801 hardware. Or maybe using CARP slows things down that much.

                      You have the latest BIOS on the 4801's?

                      Other than that, I can't think of anything but upgrading the hardware.

                      What do your interface graphs look like? That should tell you your link utilization.

                      1 Reply Last reply Reply Quote 0
                      • L
                        lamont
                        last edited by

                        It's possible that there is a soekris issue.  This pair are in production, but I have another 4801 at home running m0n0wall that I'll upgrade to pfSense 1.0.1 and test with iPerf to see if I can generate similar issues with polling and non.  Then I'll upgrade to the 1.2 snapshot and see if the upgrade of the base OS from 6.1 to 6.2 fixes any polling/performance issues.

                        I'm running the latest bios that I know of, as these boxes were only purchased about 2 months ago.

                        Thanks for your attention in this matter.  I'll report back if I can find anything useful.

                        1 Reply Last reply Reply Quote 0
                        • First post
                          Last post
                        Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.