Master stops sending CARP advertisements after HTTP-based load test



  • pfSense  2.0.3-RELEASE (i386)
    built on Fri Apr 12 10:22:57 EDT 2013
    FreeBSD 8.1-RELEASE-p13

    We have 2 pfSense boxes (both Dell 1750s) configured in an active/passive scenario with CARP used on all interfaces (except the pfsync interface). Each box has 3 physical interfaces:

    em0: pfsync
    bge0: WAN, 100Mb link connected to a fiber converter
    bge1: VLAN trunk with 3 VLANs attached, 1, 10 and 20

    Ran into a situation today where it appears that the primary has stopped sending CARP advertisements on all interfaces. This happened about the same time we started a load test via BrowserMob to one of our HTTP servers behind a CARP-based 1:1 NAT on the WAN interface. At this time the load test has long since finished but our secondary pfSense box (pfsense02) is still showing as "MASTER" on all CARP interfaces while the primary (pfsense01) is showing as "BACKUP". I've verified the advskew is still correctly set on both boxes (set to 0 on primary, 100 on secondary). While doing a tcpdump on both pfSense boxes across all networks I now only see the secondary sending CARP advertisements:

    (pfsense01 is 192.168.1.11, pfsense02 is 192.168.1.12):

    [2.0.3-RELEASE][admin@pfsense01]/root(23): tcpdump -nlibge1_vlan1 vrrp
    tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
    listening on bge1_vlan1, link-type EN10MB (Ethernet), capture size 96 bytes
    17:23:46.571651 IP 192.168.1.12 > 224.0.0.18: VRRPv2, Advertisement, vrid 2, prio 100, authtype none, intvl 1s, length 36
    17:23:47.963499 IP 192.168.1.12 > 224.0.0.18: VRRPv2, Advertisement, vrid 2, prio 100, authtype none, intvl 1s, length 36
    17:23:49.355248 IP 192.168.1.12 > 224.0.0.18: VRRPv2, Advertisement, vrid 2, prio 100, authtype none, intvl 1s, length 36
    
    [2.0.3-RELEASE][admin@pfsense02]/root(6): tcpdump -nlibge1_vlan1 vrrp
    tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
    listening on bge1_vlan1, link-type EN10MB (Ethernet), capture size 96 bytes
    17:24:00.358484 IP 192.168.1.12 > 224.0.0.18: VRRPv2, Advertisement, vrid 2, prio 100, authtype none, intvl 1s, length 36
    17:24:01.750057 IP 192.168.1.12 > 224.0.0.18: VRRPv2, Advertisement, vrid 2, prio 100, authtype none, intvl 1s, length 36
    17:24:03.141663 IP 192.168.1.12 > 224.0.0.18: VRRPv2, Advertisement, vrid 2, prio 100, authtype none, intvl 1s, length 36
    
    

    Any idea why this may be happening and what I can do besides a reboot of the master to get it to advertise again? We didn't observe any network outage during the CARP failover (which is good) but I'm now concerned if I lose the backup we'll lose connectivity. Ideally I'd like to stop this from happening at all if we undergo another load-test.



  • Doing some more digging it appears a watchdog timeout may have been the initial culprit:

    pfsense01:/var/log/system.log:

    Jun 10 13:28:05 pfsense01 kernel: vip36: MASTER -> BACKUP (more frequent advertisement received)
    Jun 10 13:28:05 pfsense01 kernel: vip36: link state changed to DOWN
    Jun 10 13:28:05 pfsense01 kernel: vip4: MASTER -> BACKUP (more frequent advertisement received)
    Jun 10 13:28:05 pfsense01 kernel: vip4: link state changed to DOWN
    Jun 10 13:28:05 pfsense01 kernel: vip3: MASTER -> BACKUP (more frequent advertisement received)
    Jun 10 13:28:05 pfsense01 kernel: vip3: link state changed to DOWN
    Jun 10 13:28:05 pfsense01 kernel: vip11: MASTER -> BACKUP (more frequent advertisement received)
    Jun 10 13:28:05 pfsense01 kernel: vip11: link state changed to DOWN
    Jun 10 13:28:05 pfsense01 kernel: vip2: MASTER -> BACKUP (more frequent advertisement received)
    Jun 10 13:28:05 pfsense01 kernel: vip2: link state changed to DOWN
    Jun 10 13:28:08 pfsense01 kernel: bge0: watchdog timeout -- resetting
    Jun 10 13:28:08 pfsense01 check_reload_status: Linkup starting bge0
    Jun 10 13:28:08 pfsense01 kernel: vip1: INIT -> BACKUP
    Jun 10 13:28:08 pfsense01 kernel: vip5: INIT -> BACKUP
    Jun 10 13:28:08 pfsense01 kernel: vip6: INIT -> BACKUP
    Jun 10 13:28:08 pfsense01 kernel: vip7: INIT -> BACKUP
    Jun 10 13:28:08 pfsense01 kernel: vip8: INIT -> BACKUP
    Jun 10 13:28:08 pfsense01 kernel: vip9: INIT -> BACKUP
    Jun 10 13:28:08 pfsense01 kernel: vip12: INIT -> BACKUP
    Jun 10 13:28:08 pfsense01 kernel: vip14: INIT -> BACKUP
    Jun 10 13:28:08 pfsense01 kernel: vip15: INIT -> BACKUP
    Jun 10 13:28:08 pfsense01 kernel: vip16: INIT -> BACKUP
    Jun 10 13:28:08 pfsense01 kernel: vip17: INIT -> BACKUP
    Jun 10 13:28:08 pfsense01 kernel: vip18: INIT -> BACKUP
    Jun 10 13:28:08 pfsense01 kernel: vip20: INIT -> BACKUP
    Jun 10 13:28:08 pfsense01 kernel: vip21: INIT -> BACKUP
    Jun 10 13:28:08 pfsense01 kernel: vip22: INIT -> BACKUP
    Jun 10 13:28:08 pfsense01 kernel: vip23: INIT -> BACKUP
    Jun 10 13:28:08 pfsense01 kernel: vip24: INIT -> BACKUP
    Jun 10 13:28:08 pfsense01 kernel: vip25: INIT -> BACKUP
    Jun 10 13:28:08 pfsense01 kernel: vip26: INIT -> BACKUP
    Jun 10 13:28:08 pfsense01 kernel: vip27: INIT -> BACKUP
    Jun 10 13:28:08 pfsense01 kernel: vip28: INIT -> BACKUP
    Jun 10 13:28:08 pfsense01 kernel: vip29: INIT -> BACKUP
    Jun 10 13:28:08 pfsense01 kernel: vip31: INIT -> BACKUP
    Jun 10 13:28:08 pfsense01 kernel: vip32: INIT -> BACKUP
    Jun 10 13:28:08 pfsense01 kernel: vip33: INIT -> BACKUP
    Jun 10 13:28:08 pfsense01 kernel: vip34: INIT -> BACKUP
    Jun 10 13:28:08 pfsense01 kernel: vip35: INIT -> BACKUP
    Jun 10 13:28:08 pfsense01 kernel: vip37: INIT -> BACKUP
    Jun 10 13:28:08 pfsense01 kernel: vip38: INIT -> BACKUP
    
    

    Any ideas? I know watchdog timeouts are about as nebulous as ever on FreeBSD but I'm hoping there's a sysctl or something that I can tweak to help prevent this. I've already verified hw.bge.allow_asf is set to 0:

    [2.0.3-RELEASE][admin@pfsense01]/var/log(38): sysctl hw.bge
    hw.bge.allow_asf: 0
    [2.0.3-RELEASE][admin@pfsense01]/var/log(35): sysctl -a | grep bge
    <5>bge1: link state changed to UP
    <5>bge1_vlan1: link state changed to UP
    <5>bge1_vlan100: link state changed to UP
    <5>bge1_vlan20: link state changed to UP
    <5>bge1_vlan10: link state changed to UP
    bge0: watchdog timeout -- resetting
    bge0: 2 link states coalesced
    <5>bge0: link state changed to UP
    bge1: watchdog timeout -- resetting
    <5>bge1: link state changed to DOWN
    <5>bge1_vlan1: link state changed to DOWN
    <5>bge1_vlan100: link state changed to DOWN
    <5>bge1_vlan20: link state changed to DOWN
    <5>bge1_vlan10: link state changed to DOWN
    <5>bge1: link state changed to UP
    <5>bge1_vlan1: link state changed to UP
    <5>bge1_vlan100: link state changed to UP
    <5>bge1_vlan20: link state changed to UP
    <5>bge1_vlan10: link state changed to UP
    bge0: watchdog timeout -- resetting
    bge0: 2 link states coalesced
    <5>bge0: link state changed to UP
    bge1: watchdog timeout -- resetting
    <5>bge1: link state changed to DOWN
    <5>bge1_vlan1: link state changed to DOWN
    <5>bge1_vlan100: link state changed to DOWN
    <5>bge1_vlan20: link state changed to DOWN
    <5>bge1_vlan10: link state changed to DOWN
    <5>bge1: link state changed to UP
    <5>bge1_vlan1: link state changed to UP
    <5>bge1_vlan100: link state changed to UP
    <5>bge1_vlan20: link state changed to UP
    <5>bge1_vlan10: link state changed to UP
    bge1: watchdog timeout -- resetting
    <5>bge1: link state changed to DOWN
    <5>bge1_vlan1: link state changed to DOWN
    <5>bge1_vlan100: link state changed to DOWN
    <5>bge1_vlan20: link state changed to DOWN
    <5>bge1_vlan10: link state changed to DOWN
    <5>bge1: link state changed to UP
    <5>bge1_vlan1: link state changed to UP
    <5>bge1_vlan100: link state changed to UP
    <5>bge1_vlan20: link state changed to UP
    <5>bge1_vlan10: link state changed to UP
    bge1: watchdog timeout -- resetting
    <5>bge1: link state changed to DOWN
    <5>bge1_vlan1: link state changed to DOWN
    <5>bge1_vlan100: link state changed to DOWN
    <5>bge1_vlan20: link state changed to DOWN
    <5>bge1_vlan10: link state changed to DOWN
    <5>bge1: link state changed to UP
    <5>bge1_vlan1: link state changed to UP
    <5>bge1_vlan100: link state changed to UP
    <5>bge1_vlan20: link state changed to UP
    <5>bge1_vlan10: link state changed to UP
    bge1: watchdog timeout -- resetting
    <5>bge1: link state changed to DOWN
    <5>bge1_vlan1: link state changed to DOWN
    <5>bge1_vlan100: link state changed to DOWN
    <5>bge1_vlan20: link state changed to DOWN
    <5>bge1_vlan10: link state changed to DOWN
    <5>bge1: link state changed to UP
    <5>bge1_vlan1: link state changed to UP
    <5>bge1_vlan100: link state changed to UP
    <5>bge1_vlan20: link state changed to UP
    <5>bge1_vlan10: link state changed to UP
    <6>arp: 192.168.1.64 moved from 1c:e6:2b:5f:e2:dd to 20:16:d8:5d:c2:34 on bge1_vlan1
    <6>arp: 192.168.1.64 moved from 20:16:d8:5d:c2:34 to 1c:e6:2b:5f:e2:dd on bge1_vlan1
    <6>arp: 192.168.1.38 moved from 74:2f:68:e9:0b:ba to 54:04:a6:4a:38:b5 on bge1_vlan1
    <6>arp: 192.168.1.78 moved from 8c:70:5a:7b:39:1c to c4:85:08:1b:3b:a4 on bge1_vlan1
    <6>arp: 192.168.1.78 moved from c4:85:08:1b:3b:a4 to 8c:70:5a:7b:39:1c on bge1_vlan1
    bge1: watchdog timeout -- resetting
    <5>bge1: link state changed to DOWN
    <5>bge1_vlan1: link state changed to DOWN
    <5>bge1_vlan100: link state changed to DOWN
    <5>bge1_vlan20: link state changed to DOWN
    <5>bge1_vlan10: link state changed to DOWN
    <5>bge1: link state changed to UP
    <5>bge1_vlan1: link state changed to UP
    <5>bge1_vlan100: link state changed to UP
    <5>bge1_vlan20: link state changed to UP
    <5>bge1_vlan10: link state changed to UP
    bge1: watchdog timeout -- resetting
    <5>bge1: link state changed to DOWN
    <5>bge1_vlan1: link state changed to DOWN
    <5>bge1_vlan100: link state changed to DOWN
    <5>bge1_vlan20: link state changed to DOWN
    <5>bge1_vlan10: link state changed to DOWN
    <5>bge1: link state changed to UP
    <5>bge1_vlan1: link state changed to UP
    <5>bge1_vlan100: link state changed to UP
    <5>bge1_vlan20: link state changed to UP
    <5>bge1_vlan10: link state changed to UP
    bge0: watchdog timeout -- resetting
    bge0: 2 link states coalesced
    <5>bge0: link state changed to UP
    <6>arp: 192.168.1.78 moved from 8c:70:5a:7b:39:1c to c4:85:08:1b:3b:a4 on bge1_vlan1
    <6>arp: 192.168.1.78 moved from c4:85:08:1b:3b:a4 to 8c:70:5a:7b:39:1c on bge1_vlan1
    bge0: watchdog timeout -- resetting
    bge0: 2 link states coalesced
    <5>bge0: link state changed to UP
    hw.bge.allow_asf: 0
    dev.bge.0.%desc: Broadcom NetXtreme Gigabit Ethernet Controller, ASIC rev. 0x002002
    dev.bge.0.%driver: bge
    dev.bge.0.%location: slot=0 function=0
    dev.bge.0.%pnpinfo: vendor=0x14e4 device=0x1648 subvendor=0x1028 subdevice=0x014a class=0x020000
    dev.bge.0.%parent: pci2
    dev.bge.0.forced_collapse: 0
    dev.bge.0.stats.FramesDroppedDueToFilters: 0
    dev.bge.0.stats.DmaWriteQueueFull: 0
    dev.bge.0.stats.DmaWriteHighPriQueueFull: 0
    dev.bge.0.stats.NoMoreRxBDs: 0
    dev.bge.0.stats.InputDiscards: 0
    dev.bge.0.stats.InputErrors: 1
    dev.bge.0.stats.RecvThresholdHit: 325594
    dev.bge.0.stats.DmaReadQueueFull: 505
    dev.bge.0.stats.DmaReadHighPriQueueFull: 0
    dev.bge.0.stats.SendDataCompQueueFull: 428
    dev.bge.0.stats.RingSetSendProdIndex: 250750
    dev.bge.0.stats.RingStatusUpdate: 467063
    dev.bge.0.stats.Interrupts: 467063
    dev.bge.0.stats.AvoidedInterrupts: 0
    dev.bge.0.stats.SendThresholdHit: 0
    dev.bge.0.stats.rx.Octets: 77286738
    dev.bge.0.stats.rx.Fragments: 0
    dev.bge.0.stats.rx.UcastPkts: 250315
    dev.bge.0.stats.rx.MulticastPkts: 324548
    dev.bge.0.stats.rx.FCSErrors: 1
    dev.bge.0.stats.rx.AlignmentErrors: 0
    dev.bge.0.stats.rx.xonPauseFramesReceived: 0
    dev.bge.0.stats.rx.xoffPauseFramesReceived: 0
    dev.bge.0.stats.rx.ControlFramesReceived: 0
    dev.bge.0.stats.rx.xoffStateEntered: 0
    dev.bge.0.stats.rx.FramesTooLong: 0
    dev.bge.0.stats.rx.Jabbers: 0
    dev.bge.0.stats.rx.UndersizePkts: 0
    dev.bge.0.stats.rx.inRangeLengthError: 0
    dev.bge.0.stats.rx.outRangeLengthError: 0
    dev.bge.0.stats.tx.Octets: 21715826
    dev.bge.0.stats.tx.Collisions: 0
    dev.bge.0.stats.tx.XonSent: 0
    dev.bge.0.stats.tx.XoffSent: 0
    dev.bge.0.stats.tx.flowControlDone: 0
    dev.bge.0.stats.tx.InternalMacTransmitErrors: 0
    dev.bge.0.stats.tx.SingleCollisionFrames: 0
    dev.bge.0.stats.tx.MultipleCollisionFrames: 0
    dev.bge.0.stats.tx.DeferredTransmissions: 0
    dev.bge.0.stats.tx.ExcessiveCollisions: 0
    dev.bge.0.stats.tx.LateCollisions: 0
    dev.bge.0.stats.tx.UcastPkts: 251179
    dev.bge.0.stats.tx.MulticastPkts: 88
    dev.bge.0.stats.tx.BroadcastPkts: 14
    dev.bge.0.stats.tx.CarrierSenseErrors: 0
    dev.bge.0.stats.tx.Discards: 0
    dev.bge.0.stats.tx.Errors: 0
    dev.bge.1.%desc: Broadcom NetXtreme Gigabit Ethernet Controller, ASIC rev. 0x002002
    dev.bge.1.%driver: bge
    dev.bge.1.%location: slot=0 function=1
    dev.bge.1.%pnpinfo: vendor=0x14e4 device=0x1648 subvendor=0x1028 subdevice=0x014a class=0x020000
    dev.bge.1.%parent: pci2
    dev.bge.1.forced_collapse: 0
    dev.bge.1.stats.FramesDroppedDueToFilters: 0
    dev.bge.1.stats.DmaWriteQueueFull: 288263
    dev.bge.1.stats.DmaWriteHighPriQueueFull: 0
    dev.bge.1.stats.NoMoreRxBDs: 0
    dev.bge.1.stats.InputDiscards: 5229
    dev.bge.1.stats.InputErrors: 647
    dev.bge.1.stats.RecvThresholdHit: 494125434
    dev.bge.1.stats.DmaReadQueueFull: 37978
    dev.bge.1.stats.DmaReadHighPriQueueFull: 2
    dev.bge.1.stats.SendDataCompQueueFull: 21
    dev.bge.1.stats.RingSetSendProdIndex: 860426035
    dev.bge.1.stats.RingStatusUpdate: 783311166
    dev.bge.1.stats.Interrupts: 783311166
    dev.bge.1.stats.AvoidedInterrupts: 0
    dev.bge.1.stats.SendThresholdHit: 0
    dev.bge.1.stats.rx.Octets: 1583585081
    dev.bge.1.stats.rx.Fragments: 73
    dev.bge.1.stats.rx.UcastPkts: 786142776
    dev.bge.1.stats.rx.MulticastPkts: 395883
    dev.bge.1.stats.rx.FCSErrors: 577
    dev.bge.1.stats.rx.AlignmentErrors: 0
    dev.bge.1.stats.rx.xonPauseFramesReceived: 3438
    dev.bge.1.stats.rx.xoffPauseFramesReceived: 147597
    dev.bge.1.stats.rx.ControlFramesReceived: 0
    dev.bge.1.stats.rx.xoffStateEntered: 0
    dev.bge.1.stats.rx.FramesTooLong: 0
    dev.bge.1.stats.rx.Jabbers: 0
    dev.bge.1.stats.rx.UndersizePkts: 0
    dev.bge.1.stats.rx.inRangeLengthError: 0
    dev.bge.1.stats.rx.outRangeLengthError: 0
    dev.bge.1.stats.tx.Octets: 181809985
    dev.bge.1.stats.tx.Collisions: 0
    dev.bge.1.stats.tx.XonSent: 0
    dev.bge.1.stats.tx.XoffSent: 0
    dev.bge.1.stats.tx.flowControlDone: 0
    dev.bge.1.stats.tx.InternalMacTransmitErrors: 0
    dev.bge.1.stats.tx.SingleCollisionFrames: 0
    dev.bge.1.stats.tx.MultipleCollisionFrames: 0
    dev.bge.1.stats.tx.DeferredTransmissions: 0
    dev.bge.1.stats.tx.ExcessiveCollisions: 0
    dev.bge.1.stats.tx.LateCollisions: 0
    dev.bge.1.stats.tx.UcastPkts: 857839350
    dev.bge.1.stats.tx.MulticastPkts: 2032515
    dev.bge.1.stats.tx.BroadcastPkts: 21415
    dev.bge.1.stats.tx.CarrierSenseErrors: 0
    dev.bge.1.stats.tx.Discards: 0
    dev.bge.1.stats.tx.Errors: 0
    dev.miibus.0.%parent: bge0
    dev.miibus.1.%parent: bge1