CARP spontaneous failover

sullrich

What do you mean by tuning recommendations? Do you have polling turned on? If so, try turning it back off and run the same carp debug methods.

lamont

By tuning I mean that I followed the polling/tuning instructions here:

http://wiki.pfsense.com/wikka.php?wakka=Tuning

I ran for a while without polling and the poor sokeris boxes (soekris 4801) were unable to deal with the load. If I disable polling, I believe the boxes will be unable to deal with more than about 4 mbit/s of traffic. We had the same failover problem without polling enabled though.

Somewhere else on the wiki I believe I saw that the wrap/soekris boards were capable of dealing with 20 mbit/s of traffic, so I didn't think our 8 or so would stress them too much.

sullrich

If the boxes are underpowered they are underpowered. My guess is that the tuning that you have done has somehow interfered with CARP advertising.

cmb

I don't believe a 4801 should have any problem with < 10 Mb. They do seem to be slightly slower than WRAP boards for some reason, but not drastically.

lamont: what happens when you think the boxes are overloaded?

If they actually are overloaded, it's possible with the way they react when under heavy load that a traffic spike could cause CARP to not send advertisements for 4 seconds.

lamont

During periods of high load I see packet loss and jitter. Right now, for example, my ping times to the internal CARP interface take from 0.5 ms to 80 ms, but when the box becomes highly loaded the pings can take 800 ms or drop around 5% of packets.

Pinging through the box from an internal device to the external gateway I see:

64 bytes from 208.96.47.161: icmp_seq=8 ttl=254 time=3.134 ms
64 bytes from 208.96.47.161: icmp_seq=9 ttl=254 time=9.953 ms
64 bytes from 208.96.47.161: icmp_seq=10 ttl=254 time=15.280 ms
64 bytes from 208.96.47.161: icmp_seq=11 ttl=254 time=1029.436 ms
64 bytes from 208.96.47.161: icmp_seq=16 ttl=254 time=6.252 ms
64 bytes from 208.96.47.161: icmp_seq=17 ttl=254 time=10.174 ms
64 bytes from 208.96.47.161: icmp_seq=18 ttl=254 time=7.520 ms
64 bytes from 208.96.47.161: icmp_seq=19 ttl=254 time=3.830 ms
64 bytes from 208.96.47.161: icmp_seq=20 ttl=254 time=5.219 ms

That's just under high load. but even under normal load, I still see the carp failover happen, and it takes a while to fail back.

sullrich

Either way my suggestion is to beef up the primary carp box a bit.

cmb

what's your CPU and network usage when you start seeing loss and high jitter? CPU usage when polling is disabled, when it's enabled it's meaningless.

On Status, Interfaces, any errors or anything look out of the ordinary?

lamont

I'm afraid to disable polling as the box goes crazy with interrupts if I do. The interfaces themselves look fine with no collisions, errors or otherwise. netstat -m seems reasonable with no memory errors or other buffer considerations.

The only thing funny is the number of suspect or lost polls in the polling results avail from sysctl:

sysctl -a |grep polling

kern.polling.idlepoll_sleeping: 0
kern.polling.stalled: 5262
kern.polling.suspect: 15177118
kern.polling.phase: 0
kern.polling.enable: 1
kern.polling.handlers: 5
kern.polling.residual_burst: 0
kern.polling.pending_polls: 0
kern.polling.lost_polls: 66215948
kern.polling.short_ticks: 1
kern.polling.reg_frac: 50
kern.polling.user_frac: 5
kern.polling.idle_poll: 1
kern.polling.each_burst: 240
kern.polling.burst_max: 1000
kern.polling.burst: 26

According to a post from the developer (the only information I could find outside of the source itself) lost_polls are nothing to worry about, but suspect polls are issues that would have resulted in a deadlock earlier, but the developer put a check in to mark where a deadlock would have occurred, fixes it, and increments the suspect_polls counter. I have no idea what it is really indicative of. Obviously, as I'm using the soekris 4801 I'm using the SIS driver. Of interest might be that I have the 5 SIS soekris board (3 onboard plus 2 hanging off a special built PCI card)

One issue we're looking at is the 10 mbit/s WAN uplink. It's possible that the link is overloaded, causing retransmissions and buffers to fill up on the pfSense box itself. We're going to upgrade that to a 100 mbit/s link and see if that helps with some of our packet loss issues. (we see 5 minute average usage as high as 8 mbit/s, and it stands to reason that if the 5 minute average was 8 mbit/s, then we attempted to exceed 10 mbit/s during that period)

cmb

Well, from the sounds of it, then it's probably pegged at 100% CPU if you turn off polling.

We haven't really done extensive testing on 4801's because we don't have many of them. Soekris hasn't contributed any hardware, but PC Engines has quite a bit so we use a lot of WRAP hardware. You seem to know what you're talking about, this seems to be some sort of legit scalability issue on 4801 hardware. Or maybe using CARP slows things down that much.

You have the latest BIOS on the 4801's?

Other than that, I can't think of anything but upgrading the hardware.

What do your interface graphs look like? That should tell you your link utilization.

lamont

It's possible that there is a soekris issue. This pair are in production, but I have another 4801 at home running m0n0wall that I'll upgrade to pfSense 1.0.1 and test with iPerf to see if I can generate similar issues with polling and non. Then I'll upgrade to the 1.2 snapshot and see if the upgrade of the base OS from 6.1 to 6.2 fixes any polling/performance issues.

I'm running the latest bios that I know of, as these boxes were only purchased about 2 months ago.

Thanks for your attention in this matter. I'll report back if I can find anything useful.