CARP spontaneous failover

lamont

Hi guys. I'm running pfSense 1.0.1 on two Soekris/WRAP boxen in production. They've been doing great, and after I took the tuning recommendations our performance issues seem to have gone away and we can do about the full 10 mbit/s that our ISP provides to our cabinet. (the pfSense boxes are doing firewall but no NAT for about 30 hosts on two networks)

The issue that I'm running into is spontaneous failover from the MASTER device to the BACKUP device. I increased the net.inet.carp.log to 2 so that I could see the events on the master, and also started monitoring the LAN interface for CARP announcements. What I'm seeing is that the master machine normally sends out an announcement once a second, but occasionally seems to either freeze up or not send out traffic for 4 seconds or so, then the backup takes over. Then the backup server runs for at least half an hour if not a few hours, and the master takes back over.

One thing I've noticed is that carp2 seems to fail first. Would that possibly indicate something funny was happening on that network/interface and causing everything else to fail over? I also see buffer space exhaustion from bsnmpd, which might indicate other problems.

Here's some sysctl info:

sysctl -a |grep carp

net.inet.ip.same_prefix_carp_only: 0
net.inet.carp.allow: 1
net.inet.carp.preempt: 1
net.inet.carp.log: 2
net.inet.carp.arpbalance: 0
net.inet.carp.suppress_preempt: 0

Some stuff from the logs:

Apr 16 13:00:12 right snmpd[710]: sendto: No buffer space available
Apr 16 13:00:27 right last message repeated 15 times
Apr 16 22:19:48 right kernel: carp2: MASTER -> BACKUP (more frequent advertisement received)
Apr 16 22:19:48 right kernel: carp1: MASTER -> BACKUP (more frequent advertisement received)
Apr 16 22:19:49 right kernel: carp0: MASTER -> BACKUP (more frequent advertisement received)
Apr 16 22:29:51 right kernel: arp_rtrequest: bad gateway XXX.YYY.51.1 (!AF_LINK)
Apr 16 22:29:51 right kernel: carp0: MASTER -> BACKUP (more frequent advertisement received)
Apr 16 22:45:03 right kernel: arp_rtrequest: bad gateway XXX.YYY.51.1 (!AF_LINK)
Apr 16 22:45:04 right kernel: carp0: MASTER -> BACKUP (more frequent advertisement received)
Apr 16 22:59:52 right kernel: arp_rtrequest: bad gateway XXX.YYY.51.1 (!AF_LINK)
Apr 16 22:59:52 right kernel: arp_rtrequest: bad gateway XXX.YYY.51.65 (!AF_LINK)
Apr 16 22:59:52 right kernel: carp1: BACKUP -> MASTER (preempting a slower master)
Apr 16 22:59:52 right kernel: arp_rtrequest: bad gateway XXX.YYY.47.164 (!AF_LINK)
Apr 17 10:15:05 right snmpd[710]: sendto: No buffer space available
Apr 17 10:15:13 right last message repeated 7 times

and finally, some useful info from tcpdump. You can see the 4 second+ pause at 18:41:24:

65757 18:41:22.930387 XXX.YYY.51.2 -> 224.0.0.18 VRRP Announcement
(v2)
65758 18:41:23.959768 XXX.YYY.51.2 -> 224.0.0.18 VRRP Announcement
(v2)
65759 18:41:24.986277 XXX.YYY.51.2 -> 224.0.0.18 VRRP Announcement
(v2)
65760 18:41:28.383362 XXX.YYY.51.3 -> 224.0.0.18 VRRP Announcement
(v2)
65761 18:41:29.776296 XXX.YYY.51.3 -> 224.0.0.18 VRRP Announcement
(v2)
65762 18:41:31.171580 XXX.YYY.51.3 -> 224.0.0.18 VRRP Announcement
(v2)

Is there any additional debugging that I can do or any suggestions someone can offer?

Thank you very much.

sullrich

What do you mean by tuning recommendations? Do you have polling turned on? If so, try turning it back off and run the same carp debug methods.

lamont

By tuning I mean that I followed the polling/tuning instructions here:

http://wiki.pfsense.com/wikka.php?wakka=Tuning

I ran for a while without polling and the poor sokeris boxes (soekris 4801) were unable to deal with the load. If I disable polling, I believe the boxes will be unable to deal with more than about 4 mbit/s of traffic. We had the same failover problem without polling enabled though.

Somewhere else on the wiki I believe I saw that the wrap/soekris boards were capable of dealing with 20 mbit/s of traffic, so I didn't think our 8 or so would stress them too much.

sullrich

If the boxes are underpowered they are underpowered. My guess is that the tuning that you have done has somehow interfered with CARP advertising.

cmb

I don't believe a 4801 should have any problem with < 10 Mb. They do seem to be slightly slower than WRAP boards for some reason, but not drastically.

lamont: what happens when you think the boxes are overloaded?

If they actually are overloaded, it's possible with the way they react when under heavy load that a traffic spike could cause CARP to not send advertisements for 4 seconds.

lamont

During periods of high load I see packet loss and jitter. Right now, for example, my ping times to the internal CARP interface take from 0.5 ms to 80 ms, but when the box becomes highly loaded the pings can take 800 ms or drop around 5% of packets.

Pinging through the box from an internal device to the external gateway I see:

64 bytes from 208.96.47.161: icmp_seq=8 ttl=254 time=3.134 ms
64 bytes from 208.96.47.161: icmp_seq=9 ttl=254 time=9.953 ms
64 bytes from 208.96.47.161: icmp_seq=10 ttl=254 time=15.280 ms
64 bytes from 208.96.47.161: icmp_seq=11 ttl=254 time=1029.436 ms
64 bytes from 208.96.47.161: icmp_seq=16 ttl=254 time=6.252 ms
64 bytes from 208.96.47.161: icmp_seq=17 ttl=254 time=10.174 ms
64 bytes from 208.96.47.161: icmp_seq=18 ttl=254 time=7.520 ms
64 bytes from 208.96.47.161: icmp_seq=19 ttl=254 time=3.830 ms
64 bytes from 208.96.47.161: icmp_seq=20 ttl=254 time=5.219 ms

That's just under high load. but even under normal load, I still see the carp failover happen, and it takes a while to fail back.

sullrich

Either way my suggestion is to beef up the primary carp box a bit.

cmb

what's your CPU and network usage when you start seeing loss and high jitter? CPU usage when polling is disabled, when it's enabled it's meaningless.

On Status, Interfaces, any errors or anything look out of the ordinary?

lamont

I'm afraid to disable polling as the box goes crazy with interrupts if I do. The interfaces themselves look fine with no collisions, errors or otherwise. netstat -m seems reasonable with no memory errors or other buffer considerations.

The only thing funny is the number of suspect or lost polls in the polling results avail from sysctl:

sysctl -a |grep polling

kern.polling.idlepoll_sleeping: 0
kern.polling.stalled: 5262
kern.polling.suspect: 15177118
kern.polling.phase: 0
kern.polling.enable: 1
kern.polling.handlers: 5
kern.polling.residual_burst: 0
kern.polling.pending_polls: 0
kern.polling.lost_polls: 66215948
kern.polling.short_ticks: 1
kern.polling.reg_frac: 50
kern.polling.user_frac: 5
kern.polling.idle_poll: 1
kern.polling.each_burst: 240
kern.polling.burst_max: 1000
kern.polling.burst: 26

According to a post from the developer (the only information I could find outside of the source itself) lost_polls are nothing to worry about, but suspect polls are issues that would have resulted in a deadlock earlier, but the developer put a check in to mark where a deadlock would have occurred, fixes it, and increments the suspect_polls counter. I have no idea what it is really indicative of. Obviously, as I'm using the soekris 4801 I'm using the SIS driver. Of interest might be that I have the 5 SIS soekris board (3 onboard plus 2 hanging off a special built PCI card)

One issue we're looking at is the 10 mbit/s WAN uplink. It's possible that the link is overloaded, causing retransmissions and buffers to fill up on the pfSense box itself. We're going to upgrade that to a 100 mbit/s link and see if that helps with some of our packet loss issues. (we see 5 minute average usage as high as 8 mbit/s, and it stands to reason that if the 5 minute average was 8 mbit/s, then we attempted to exceed 10 mbit/s during that period)

cmb

Well, from the sounds of it, then it's probably pegged at 100% CPU if you turn off polling.

We haven't really done extensive testing on 4801's because we don't have many of them. Soekris hasn't contributed any hardware, but PC Engines has quite a bit so we use a lot of WRAP hardware. You seem to know what you're talking about, this seems to be some sort of legit scalability issue on 4801 hardware. Or maybe using CARP slows things down that much.

You have the latest BIOS on the 4801's?

Other than that, I can't think of anything but upgrading the hardware.

What do your interface graphs look like? That should tell you your link utilization.

lamont

It's possible that there is a soekris issue. This pair are in production, but I have another 4801 at home running m0n0wall that I'll upgrade to pfSense 1.0.1 and test with iPerf to see if I can generate similar issues with polling and non. Then I'll upgrade to the 1.2 snapshot and see if the upgrade of the base OS from 6.1 to 6.2 fixes any polling/performance issues.

I'm running the latest bios that I know of, as these boxes were only purchased about 2 months ago.

Thanks for your attention in this matter. I'll report back if I can find anything useful.