Master stops sending CARP advertisements after HTTP-based load test
-
pfSense 2.0.3-RELEASE (i386)
built on Fri Apr 12 10:22:57 EDT 2013
FreeBSD 8.1-RELEASE-p13We have 2 pfSense boxes (both Dell 1750s) configured in an active/passive scenario with CARP used on all interfaces (except the pfsync interface). Each box has 3 physical interfaces:
em0: pfsync
bge0: WAN, 100Mb link connected to a fiber converter
bge1: VLAN trunk with 3 VLANs attached, 1, 10 and 20Ran into a situation today where it appears that the primary has stopped sending CARP advertisements on all interfaces. This happened about the same time we started a load test via BrowserMob to one of our HTTP servers behind a CARP-based 1:1 NAT on the WAN interface. At this time the load test has long since finished but our secondary pfSense box (pfsense02) is still showing as "MASTER" on all CARP interfaces while the primary (pfsense01) is showing as "BACKUP". I've verified the advskew is still correctly set on both boxes (set to 0 on primary, 100 on secondary). While doing a tcpdump on both pfSense boxes across all networks I now only see the secondary sending CARP advertisements:
(pfsense01 is 192.168.1.11, pfsense02 is 192.168.1.12):
[2.0.3-RELEASE][admin@pfsense01]/root(23): tcpdump -nlibge1_vlan1 vrrp tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on bge1_vlan1, link-type EN10MB (Ethernet), capture size 96 bytes 17:23:46.571651 IP 192.168.1.12 > 224.0.0.18: VRRPv2, Advertisement, vrid 2, prio 100, authtype none, intvl 1s, length 36 17:23:47.963499 IP 192.168.1.12 > 224.0.0.18: VRRPv2, Advertisement, vrid 2, prio 100, authtype none, intvl 1s, length 36 17:23:49.355248 IP 192.168.1.12 > 224.0.0.18: VRRPv2, Advertisement, vrid 2, prio 100, authtype none, intvl 1s, length 36
[2.0.3-RELEASE][admin@pfsense02]/root(6): tcpdump -nlibge1_vlan1 vrrp tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on bge1_vlan1, link-type EN10MB (Ethernet), capture size 96 bytes 17:24:00.358484 IP 192.168.1.12 > 224.0.0.18: VRRPv2, Advertisement, vrid 2, prio 100, authtype none, intvl 1s, length 36 17:24:01.750057 IP 192.168.1.12 > 224.0.0.18: VRRPv2, Advertisement, vrid 2, prio 100, authtype none, intvl 1s, length 36 17:24:03.141663 IP 192.168.1.12 > 224.0.0.18: VRRPv2, Advertisement, vrid 2, prio 100, authtype none, intvl 1s, length 36
Any idea why this may be happening and what I can do besides a reboot of the master to get it to advertise again? We didn't observe any network outage during the CARP failover (which is good) but I'm now concerned if I lose the backup we'll lose connectivity. Ideally I'd like to stop this from happening at all if we undergo another load-test.
-
Doing some more digging it appears a watchdog timeout may have been the initial culprit:
pfsense01:/var/log/system.log:
Jun 10 13:28:05 pfsense01 kernel: vip36: MASTER -> BACKUP (more frequent advertisement received) Jun 10 13:28:05 pfsense01 kernel: vip36: link state changed to DOWN Jun 10 13:28:05 pfsense01 kernel: vip4: MASTER -> BACKUP (more frequent advertisement received) Jun 10 13:28:05 pfsense01 kernel: vip4: link state changed to DOWN Jun 10 13:28:05 pfsense01 kernel: vip3: MASTER -> BACKUP (more frequent advertisement received) Jun 10 13:28:05 pfsense01 kernel: vip3: link state changed to DOWN Jun 10 13:28:05 pfsense01 kernel: vip11: MASTER -> BACKUP (more frequent advertisement received) Jun 10 13:28:05 pfsense01 kernel: vip11: link state changed to DOWN Jun 10 13:28:05 pfsense01 kernel: vip2: MASTER -> BACKUP (more frequent advertisement received) Jun 10 13:28:05 pfsense01 kernel: vip2: link state changed to DOWN Jun 10 13:28:08 pfsense01 kernel: bge0: watchdog timeout -- resetting Jun 10 13:28:08 pfsense01 check_reload_status: Linkup starting bge0 Jun 10 13:28:08 pfsense01 kernel: vip1: INIT -> BACKUP Jun 10 13:28:08 pfsense01 kernel: vip5: INIT -> BACKUP Jun 10 13:28:08 pfsense01 kernel: vip6: INIT -> BACKUP Jun 10 13:28:08 pfsense01 kernel: vip7: INIT -> BACKUP Jun 10 13:28:08 pfsense01 kernel: vip8: INIT -> BACKUP Jun 10 13:28:08 pfsense01 kernel: vip9: INIT -> BACKUP Jun 10 13:28:08 pfsense01 kernel: vip12: INIT -> BACKUP Jun 10 13:28:08 pfsense01 kernel: vip14: INIT -> BACKUP Jun 10 13:28:08 pfsense01 kernel: vip15: INIT -> BACKUP Jun 10 13:28:08 pfsense01 kernel: vip16: INIT -> BACKUP Jun 10 13:28:08 pfsense01 kernel: vip17: INIT -> BACKUP Jun 10 13:28:08 pfsense01 kernel: vip18: INIT -> BACKUP Jun 10 13:28:08 pfsense01 kernel: vip20: INIT -> BACKUP Jun 10 13:28:08 pfsense01 kernel: vip21: INIT -> BACKUP Jun 10 13:28:08 pfsense01 kernel: vip22: INIT -> BACKUP Jun 10 13:28:08 pfsense01 kernel: vip23: INIT -> BACKUP Jun 10 13:28:08 pfsense01 kernel: vip24: INIT -> BACKUP Jun 10 13:28:08 pfsense01 kernel: vip25: INIT -> BACKUP Jun 10 13:28:08 pfsense01 kernel: vip26: INIT -> BACKUP Jun 10 13:28:08 pfsense01 kernel: vip27: INIT -> BACKUP Jun 10 13:28:08 pfsense01 kernel: vip28: INIT -> BACKUP Jun 10 13:28:08 pfsense01 kernel: vip29: INIT -> BACKUP Jun 10 13:28:08 pfsense01 kernel: vip31: INIT -> BACKUP Jun 10 13:28:08 pfsense01 kernel: vip32: INIT -> BACKUP Jun 10 13:28:08 pfsense01 kernel: vip33: INIT -> BACKUP Jun 10 13:28:08 pfsense01 kernel: vip34: INIT -> BACKUP Jun 10 13:28:08 pfsense01 kernel: vip35: INIT -> BACKUP Jun 10 13:28:08 pfsense01 kernel: vip37: INIT -> BACKUP Jun 10 13:28:08 pfsense01 kernel: vip38: INIT -> BACKUP
Any ideas? I know watchdog timeouts are about as nebulous as ever on FreeBSD but I'm hoping there's a sysctl or something that I can tweak to help prevent this. I've already verified hw.bge.allow_asf is set to 0:
[2.0.3-RELEASE][admin@pfsense01]/var/log(38): sysctl hw.bge hw.bge.allow_asf: 0 [2.0.3-RELEASE][admin@pfsense01]/var/log(35): sysctl -a | grep bge <5>bge1: link state changed to UP <5>bge1_vlan1: link state changed to UP <5>bge1_vlan100: link state changed to UP <5>bge1_vlan20: link state changed to UP <5>bge1_vlan10: link state changed to UP bge0: watchdog timeout -- resetting bge0: 2 link states coalesced <5>bge0: link state changed to UP bge1: watchdog timeout -- resetting <5>bge1: link state changed to DOWN <5>bge1_vlan1: link state changed to DOWN <5>bge1_vlan100: link state changed to DOWN <5>bge1_vlan20: link state changed to DOWN <5>bge1_vlan10: link state changed to DOWN <5>bge1: link state changed to UP <5>bge1_vlan1: link state changed to UP <5>bge1_vlan100: link state changed to UP <5>bge1_vlan20: link state changed to UP <5>bge1_vlan10: link state changed to UP bge0: watchdog timeout -- resetting bge0: 2 link states coalesced <5>bge0: link state changed to UP bge1: watchdog timeout -- resetting <5>bge1: link state changed to DOWN <5>bge1_vlan1: link state changed to DOWN <5>bge1_vlan100: link state changed to DOWN <5>bge1_vlan20: link state changed to DOWN <5>bge1_vlan10: link state changed to DOWN <5>bge1: link state changed to UP <5>bge1_vlan1: link state changed to UP <5>bge1_vlan100: link state changed to UP <5>bge1_vlan20: link state changed to UP <5>bge1_vlan10: link state changed to UP bge1: watchdog timeout -- resetting <5>bge1: link state changed to DOWN <5>bge1_vlan1: link state changed to DOWN <5>bge1_vlan100: link state changed to DOWN <5>bge1_vlan20: link state changed to DOWN <5>bge1_vlan10: link state changed to DOWN <5>bge1: link state changed to UP <5>bge1_vlan1: link state changed to UP <5>bge1_vlan100: link state changed to UP <5>bge1_vlan20: link state changed to UP <5>bge1_vlan10: link state changed to UP bge1: watchdog timeout -- resetting <5>bge1: link state changed to DOWN <5>bge1_vlan1: link state changed to DOWN <5>bge1_vlan100: link state changed to DOWN <5>bge1_vlan20: link state changed to DOWN <5>bge1_vlan10: link state changed to DOWN <5>bge1: link state changed to UP <5>bge1_vlan1: link state changed to UP <5>bge1_vlan100: link state changed to UP <5>bge1_vlan20: link state changed to UP <5>bge1_vlan10: link state changed to UP bge1: watchdog timeout -- resetting <5>bge1: link state changed to DOWN <5>bge1_vlan1: link state changed to DOWN <5>bge1_vlan100: link state changed to DOWN <5>bge1_vlan20: link state changed to DOWN <5>bge1_vlan10: link state changed to DOWN <5>bge1: link state changed to UP <5>bge1_vlan1: link state changed to UP <5>bge1_vlan100: link state changed to UP <5>bge1_vlan20: link state changed to UP <5>bge1_vlan10: link state changed to UP <6>arp: 192.168.1.64 moved from 1c:e6:2b:5f:e2:dd to 20:16:d8:5d:c2:34 on bge1_vlan1 <6>arp: 192.168.1.64 moved from 20:16:d8:5d:c2:34 to 1c:e6:2b:5f:e2:dd on bge1_vlan1 <6>arp: 192.168.1.38 moved from 74:2f:68:e9:0b:ba to 54:04:a6:4a:38:b5 on bge1_vlan1 <6>arp: 192.168.1.78 moved from 8c:70:5a:7b:39:1c to c4:85:08:1b:3b:a4 on bge1_vlan1 <6>arp: 192.168.1.78 moved from c4:85:08:1b:3b:a4 to 8c:70:5a:7b:39:1c on bge1_vlan1 bge1: watchdog timeout -- resetting <5>bge1: link state changed to DOWN <5>bge1_vlan1: link state changed to DOWN <5>bge1_vlan100: link state changed to DOWN <5>bge1_vlan20: link state changed to DOWN <5>bge1_vlan10: link state changed to DOWN <5>bge1: link state changed to UP <5>bge1_vlan1: link state changed to UP <5>bge1_vlan100: link state changed to UP <5>bge1_vlan20: link state changed to UP <5>bge1_vlan10: link state changed to UP bge1: watchdog timeout -- resetting <5>bge1: link state changed to DOWN <5>bge1_vlan1: link state changed to DOWN <5>bge1_vlan100: link state changed to DOWN <5>bge1_vlan20: link state changed to DOWN <5>bge1_vlan10: link state changed to DOWN <5>bge1: link state changed to UP <5>bge1_vlan1: link state changed to UP <5>bge1_vlan100: link state changed to UP <5>bge1_vlan20: link state changed to UP <5>bge1_vlan10: link state changed to UP bge0: watchdog timeout -- resetting bge0: 2 link states coalesced <5>bge0: link state changed to UP <6>arp: 192.168.1.78 moved from 8c:70:5a:7b:39:1c to c4:85:08:1b:3b:a4 on bge1_vlan1 <6>arp: 192.168.1.78 moved from c4:85:08:1b:3b:a4 to 8c:70:5a:7b:39:1c on bge1_vlan1 bge0: watchdog timeout -- resetting bge0: 2 link states coalesced <5>bge0: link state changed to UP hw.bge.allow_asf: 0 dev.bge.0.%desc: Broadcom NetXtreme Gigabit Ethernet Controller, ASIC rev. 0x002002 dev.bge.0.%driver: bge dev.bge.0.%location: slot=0 function=0 dev.bge.0.%pnpinfo: vendor=0x14e4 device=0x1648 subvendor=0x1028 subdevice=0x014a class=0x020000 dev.bge.0.%parent: pci2 dev.bge.0.forced_collapse: 0 dev.bge.0.stats.FramesDroppedDueToFilters: 0 dev.bge.0.stats.DmaWriteQueueFull: 0 dev.bge.0.stats.DmaWriteHighPriQueueFull: 0 dev.bge.0.stats.NoMoreRxBDs: 0 dev.bge.0.stats.InputDiscards: 0 dev.bge.0.stats.InputErrors: 1 dev.bge.0.stats.RecvThresholdHit: 325594 dev.bge.0.stats.DmaReadQueueFull: 505 dev.bge.0.stats.DmaReadHighPriQueueFull: 0 dev.bge.0.stats.SendDataCompQueueFull: 428 dev.bge.0.stats.RingSetSendProdIndex: 250750 dev.bge.0.stats.RingStatusUpdate: 467063 dev.bge.0.stats.Interrupts: 467063 dev.bge.0.stats.AvoidedInterrupts: 0 dev.bge.0.stats.SendThresholdHit: 0 dev.bge.0.stats.rx.Octets: 77286738 dev.bge.0.stats.rx.Fragments: 0 dev.bge.0.stats.rx.UcastPkts: 250315 dev.bge.0.stats.rx.MulticastPkts: 324548 dev.bge.0.stats.rx.FCSErrors: 1 dev.bge.0.stats.rx.AlignmentErrors: 0 dev.bge.0.stats.rx.xonPauseFramesReceived: 0 dev.bge.0.stats.rx.xoffPauseFramesReceived: 0 dev.bge.0.stats.rx.ControlFramesReceived: 0 dev.bge.0.stats.rx.xoffStateEntered: 0 dev.bge.0.stats.rx.FramesTooLong: 0 dev.bge.0.stats.rx.Jabbers: 0 dev.bge.0.stats.rx.UndersizePkts: 0 dev.bge.0.stats.rx.inRangeLengthError: 0 dev.bge.0.stats.rx.outRangeLengthError: 0 dev.bge.0.stats.tx.Octets: 21715826 dev.bge.0.stats.tx.Collisions: 0 dev.bge.0.stats.tx.XonSent: 0 dev.bge.0.stats.tx.XoffSent: 0 dev.bge.0.stats.tx.flowControlDone: 0 dev.bge.0.stats.tx.InternalMacTransmitErrors: 0 dev.bge.0.stats.tx.SingleCollisionFrames: 0 dev.bge.0.stats.tx.MultipleCollisionFrames: 0 dev.bge.0.stats.tx.DeferredTransmissions: 0 dev.bge.0.stats.tx.ExcessiveCollisions: 0 dev.bge.0.stats.tx.LateCollisions: 0 dev.bge.0.stats.tx.UcastPkts: 251179 dev.bge.0.stats.tx.MulticastPkts: 88 dev.bge.0.stats.tx.BroadcastPkts: 14 dev.bge.0.stats.tx.CarrierSenseErrors: 0 dev.bge.0.stats.tx.Discards: 0 dev.bge.0.stats.tx.Errors: 0 dev.bge.1.%desc: Broadcom NetXtreme Gigabit Ethernet Controller, ASIC rev. 0x002002 dev.bge.1.%driver: bge dev.bge.1.%location: slot=0 function=1 dev.bge.1.%pnpinfo: vendor=0x14e4 device=0x1648 subvendor=0x1028 subdevice=0x014a class=0x020000 dev.bge.1.%parent: pci2 dev.bge.1.forced_collapse: 0 dev.bge.1.stats.FramesDroppedDueToFilters: 0 dev.bge.1.stats.DmaWriteQueueFull: 288263 dev.bge.1.stats.DmaWriteHighPriQueueFull: 0 dev.bge.1.stats.NoMoreRxBDs: 0 dev.bge.1.stats.InputDiscards: 5229 dev.bge.1.stats.InputErrors: 647 dev.bge.1.stats.RecvThresholdHit: 494125434 dev.bge.1.stats.DmaReadQueueFull: 37978 dev.bge.1.stats.DmaReadHighPriQueueFull: 2 dev.bge.1.stats.SendDataCompQueueFull: 21 dev.bge.1.stats.RingSetSendProdIndex: 860426035 dev.bge.1.stats.RingStatusUpdate: 783311166 dev.bge.1.stats.Interrupts: 783311166 dev.bge.1.stats.AvoidedInterrupts: 0 dev.bge.1.stats.SendThresholdHit: 0 dev.bge.1.stats.rx.Octets: 1583585081 dev.bge.1.stats.rx.Fragments: 73 dev.bge.1.stats.rx.UcastPkts: 786142776 dev.bge.1.stats.rx.MulticastPkts: 395883 dev.bge.1.stats.rx.FCSErrors: 577 dev.bge.1.stats.rx.AlignmentErrors: 0 dev.bge.1.stats.rx.xonPauseFramesReceived: 3438 dev.bge.1.stats.rx.xoffPauseFramesReceived: 147597 dev.bge.1.stats.rx.ControlFramesReceived: 0 dev.bge.1.stats.rx.xoffStateEntered: 0 dev.bge.1.stats.rx.FramesTooLong: 0 dev.bge.1.stats.rx.Jabbers: 0 dev.bge.1.stats.rx.UndersizePkts: 0 dev.bge.1.stats.rx.inRangeLengthError: 0 dev.bge.1.stats.rx.outRangeLengthError: 0 dev.bge.1.stats.tx.Octets: 181809985 dev.bge.1.stats.tx.Collisions: 0 dev.bge.1.stats.tx.XonSent: 0 dev.bge.1.stats.tx.XoffSent: 0 dev.bge.1.stats.tx.flowControlDone: 0 dev.bge.1.stats.tx.InternalMacTransmitErrors: 0 dev.bge.1.stats.tx.SingleCollisionFrames: 0 dev.bge.1.stats.tx.MultipleCollisionFrames: 0 dev.bge.1.stats.tx.DeferredTransmissions: 0 dev.bge.1.stats.tx.ExcessiveCollisions: 0 dev.bge.1.stats.tx.LateCollisions: 0 dev.bge.1.stats.tx.UcastPkts: 857839350 dev.bge.1.stats.tx.MulticastPkts: 2032515 dev.bge.1.stats.tx.BroadcastPkts: 21415 dev.bge.1.stats.tx.CarrierSenseErrors: 0 dev.bge.1.stats.tx.Discards: 0 dev.bge.1.stats.tx.Errors: 0 dev.miibus.0.%parent: bge0 dev.miibus.1.%parent: bge1