Shared Virtual IPs unexpectedly toggling between two CARP members on 2.1.4
-
I have 4 GB of RAM and a 4 GB flash drive (but am using the 2 GB image).
-
cmb, you were correct. I ran tcpdump on the WAN NIC of the backup and it appears to drop for a couple seconds every now and then:
09:25:01.549258 IP <master ip="">> 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 0, authtype none, intvl 1s, length 36 09:25:02.555175 IP <master ip="">> 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 0, authtype none, intvl 1s, length 36 09:25:05.960868 IP <backup ip="">> 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 100, authtype none, intvl 1s, length 36 09:25:06.146762 IP <master ip="">> 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 0, authtype none, intvl 1s, length 36 09:41:14.115512 IP <master ip="">> 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 0, authtype none, intvl 1s, length 36 09:41:15.895334 IP <master ip="">> 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 0, authtype none, intvl 1s, length 36 09:41:19.300492 IP <backup ip="">> 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 100, authtype none, intvl 1s, length 36 09:41:19.315849 IP <master ip="">> 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 0, authtype none, intvl 1s, length 36</master></backup></master></master></master></backup></master></master>
I'm now going to start a corresponding tcpdump on the master's WAN link too as a cross reference to see if the packets are being sent. I find it hard to believe that packets are being lost between the master and backup because they are connected to an enterprise chassis/switch on an isolated VLAN. My suspicion at this point is that the master is getting too busy and is either delaying or occasionally not sending the packets but the tcpdump should show that.
-
This is simple enough that yeah, if you can leave a SSH session up running tcpdump like that, that'll suffice. Then you're not chewing up any RAM or storage space.
It's definitely that either the primary stops sending it, or it disappears after leaving the primary. Don't be so quick to write off switches as the possible cause, when we hit these scenarios with support customers, it's something outside the firewall more often than not. Even "enterprise" switches have issues with multicast from time to time.
Though if you can trigger it simply by running a speed test, I'd guess it's not likely the network. That would also be the most touchy scenario, by far, I'd ever heard of where load caused missed CARP advertisements. The only scenarios of those I've seen is where a huge flood of new connections (relative to the CPU of the hardware) comes in over a very short period of time and briefly overwhelms the system. The first one I can recall seeing in the real world was a number of years ago, someone with an ALIX in their colocation rack, with a mail server trying to blast out tens of thousands of connections in less than 1 second. It actually handled it quite nicely considering that's a slow Geode 500 MHz CPU typically suited for SOHO networks, not colo. It passed most of the traffic, but caused it to miss a couple advertisements, just enough to trigger failover. DDoS attacks among the more common scenarios. Anything that creates a significant flood of new connections in a short period.
-
It took a few days to observer a fail-over but 3 occurred over the course of less than one minute. I was running TCPDUMP on the master and it shows that the master stopped sending the advertisements and even observed the secondary sending out the advertisement during the missed interval:
08:35:31.338307 IP <master ip="">> 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 0, authtype none, intvl 1s, length 36 08:35:32.839573 IP <master ip="">> 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 0, authtype none, intvl 1s, length 36 08:35:36.249499 IP <backup ip="">> 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 100, authtype none, intvl 1s, length 36 08:35:36.425591 IP <master ip="">> 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 0, authtype none, intvl 1s, length 36 08:35:37.430875 IP <master ip="">> 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 0, authtype none, intvl 1s, length 36 08:35:56.525097 IP <master ip="">> 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 0, authtype none, intvl 1s, length 36 08:35:57.529976 IP <master ip="">> 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 0, authtype none, intvl 1s, length 36 08:36:00.936304 IP <backup ip="">> 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 100, authtype none, intvl 1s, length 36 08:36:01.073325 IP <master ip="">> 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 0, authtype none, intvl 1s, length 36 08:36:02.442006 IP <master ip="">> 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 0, authtype none, intvl 1s, length 36</master></master></backup></master></master></master></master></backup></master></master>
What is your suggestion on what I should try next to isolate why the advertisements are occasionally not being sent?
-
What does the output of the following commands show?
netstat -m sysctl dev.em
What are the details of your hardware?
Can you still reliably trigger it just running a speed test? That indicates a much bigger issue. I'm guessing it's nowhere near that simple. It's likely a large flood of something in a very short period that's overwhelming the system.
-
netstat -m
3518/3397/6915 mbufs in use (current/cache/total) 3103/3355/6458/131072 mbuf clusters in use (current/cache/total/max) 3101/2147 mbuf+clusters out of packet secondary zone in use (current/cache) 0/44/44/65536 4k (page size) jumbo clusters in use (current/cache/total/max) 0/0/0/32768 9k jumbo clusters in use (current/cache/total/max) 0/0/0/16384 16k jumbo clusters in use (current/cache/total/max) 7097K/7735K/14832K bytes allocated to network (current/cache/total) 0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters) 0/0/0 requests for jumbo clusters denied (4k/9k/16k) 0/0/0 sfbufs in use (current/peak/max) 0 requests for sfbufs denied 0 requests for sfbufs delayed 0 requests for I/O initiated by sendfile 0 calls to protocol drain routines
sysctl dev.em (em0=CARP; em1=LAN; em2=WAN)
dev.em.0.%desc: Intel(R) PRO/1000 Network Connection 7.3.8 dev.em.0.%driver: em dev.em.0.%location: slot=0 function=0 dev.em.0.%pnpinfo: vendor=0x8086 device=0x10d3 subvendor=0x8086 subdevice=0xa01f class=0x020000 dev.em.0.%parent: pci1 dev.em.0.nvm: -1 dev.em.0.debug: -1 dev.em.0.fc: 3 dev.em.0.rx_int_delay: 0 dev.em.0.tx_int_delay: 66 dev.em.0.rx_abs_int_delay: 66 dev.em.0.tx_abs_int_delay: 66 dev.em.0.itr: 488 dev.em.0.rx_processing_limit: 100 dev.em.0.eee_control: 1 dev.em.0.link_irq: 2 dev.em.0.mbuf_alloc_fail: 0 dev.em.0.cluster_alloc_fail: 0 dev.em.0.dropped: 0 dev.em.0.tx_dma_fail: 0 dev.em.0.rx_overruns: 0 dev.em.0.watchdog_timeouts: 0 dev.em.0.device_control: 1477444168 dev.em.0.rx_control: 67141634 dev.em.0.fc_high_water: 18432 dev.em.0.fc_low_water: 16932 dev.em.0.queue0.txd_head: 866 dev.em.0.queue0.txd_tail: 866 dev.em.0.queue0.tx_irq: 593181766 dev.em.0.queue0.no_desc_avail: 0 dev.em.0.queue0.rxd_head: 519 dev.em.0.queue0.rxd_tail: 518 dev.em.0.queue0.rx_irq: 2718514 dev.em.0.mac_stats.excess_coll: 0 dev.em.0.mac_stats.single_coll: 0 dev.em.0.mac_stats.multiple_coll: 0 dev.em.0.mac_stats.late_coll: 0 dev.em.0.mac_stats.collision_count: 0 dev.em.0.mac_stats.symbol_errors: 0 dev.em.0.mac_stats.sequence_errors: 0 dev.em.0.mac_stats.defer_count: 0 dev.em.0.mac_stats.missed_packets: 0 dev.em.0.mac_stats.recv_no_buff: 0 dev.em.0.mac_stats.recv_undersize: 0 dev.em.0.mac_stats.recv_fragmented: 0 dev.em.0.mac_stats.recv_oversize: 0 dev.em.0.mac_stats.recv_jabber: 0 dev.em.0.mac_stats.recv_errs: 0 dev.em.0.mac_stats.crc_errs: 0 dev.em.0.mac_stats.alignment_errs: 0 dev.em.0.mac_stats.coll_ext_errs: 0 dev.em.0.mac_stats.xon_recvd: 0 dev.em.0.mac_stats.xon_txd: 0 dev.em.0.mac_stats.xoff_recvd: 0 dev.em.0.mac_stats.xoff_txd: 0 dev.em.0.mac_stats.total_pkts_recvd: 2948615 dev.em.0.mac_stats.good_pkts_recvd: 2948615 dev.em.0.mac_stats.bcast_pkts_recvd: 1027 dev.em.0.mac_stats.mcast_pkts_recvd: 2419003 dev.em.0.mac_stats.rx_frames_64: 1835 dev.em.0.mac_stats.rx_frames_65_127: 660675 dev.em.0.mac_stats.rx_frames_128_255: 253676 dev.em.0.mac_stats.rx_frames_256_511: 1256024 dev.em.0.mac_stats.rx_frames_512_1023: 774274 dev.em.0.mac_stats.rx_frames_1024_1522: 2131 dev.em.0.mac_stats.good_octets_recvd: 1050155434 dev.em.0.mac_stats.good_octets_txd: 399584285100 dev.em.0.mac_stats.total_pkts_txd: 716385169 dev.em.0.mac_stats.good_pkts_txd: 716385169 dev.em.0.mac_stats.bcast_pkts_txd: 809 dev.em.0.mac_stats.mcast_pkts_txd: 716038095 dev.em.0.mac_stats.tx_frames_64: 1842 dev.em.0.mac_stats.tx_frames_65_127: 1445429 dev.em.0.mac_stats.tx_frames_128_255: 42164509 dev.em.0.mac_stats.tx_frames_256_511: 23937085 dev.em.0.mac_stats.tx_frames_512_1023: 646458006 dev.em.0.mac_stats.tx_frames_1024_1522: 2378298 dev.em.0.mac_stats.tso_txd: 0 dev.em.0.mac_stats.tso_ctx_fail: 0 dev.em.0.interrupts.asserts: 6 dev.em.0.interrupts.rx_pkt_timer: 0 dev.em.0.interrupts.rx_abs_timer: 0 dev.em.0.interrupts.tx_pkt_timer: 0 dev.em.0.interrupts.tx_abs_timer: 0 dev.em.0.interrupts.tx_queue_empty: 0 dev.em.0.interrupts.tx_queue_min_thresh: 0 dev.em.0.interrupts.rx_desc_min_thresh: 0 dev.em.0.interrupts.rx_overrun: 0 dev.em.1.%desc: Intel(R) PRO/1000 Network Connection 7.3.8 dev.em.1.%driver: em dev.em.1.%location: slot=0 function=0 dev.em.1.%pnpinfo: vendor=0x8086 device=0x10d3 subvendor=0x15d9 subdevice=0x10d3 class=0x020000 dev.em.1.%parent: pci2 dev.em.1.nvm: -1 dev.em.1.debug: -1 dev.em.1.fc: 3 dev.em.1.rx_int_delay: 0 dev.em.1.tx_int_delay: 66 dev.em.1.rx_abs_int_delay: 66 dev.em.1.tx_abs_int_delay: 66 dev.em.1.itr: 488 dev.em.1.rx_processing_limit: 100 dev.em.1.eee_control: 1 dev.em.1.link_irq: 2 dev.em.1.mbuf_alloc_fail: 0 dev.em.1.cluster_alloc_fail: 0 dev.em.1.dropped: 0 dev.em.1.tx_dma_fail: 0 dev.em.1.rx_overruns: 0 dev.em.1.watchdog_timeouts: 0 dev.em.1.device_control: 1477444168 dev.em.1.rx_control: 67141658 dev.em.1.fc_high_water: 18432 dev.em.1.fc_low_water: 16932 dev.em.1.queue0.txd_head: 457 dev.em.1.queue0.txd_tail: 458 dev.em.1.queue0.tx_irq: 1022512833 dev.em.1.queue0.no_desc_avail: 0 dev.em.1.queue0.rxd_head: 569 dev.em.1.queue0.rxd_tail: 568 dev.em.1.queue0.rx_irq: 933448236 dev.em.1.mac_stats.excess_coll: 0 dev.em.1.mac_stats.single_coll: 0 dev.em.1.mac_stats.multiple_coll: 0 dev.em.1.mac_stats.late_coll: 0 dev.em.1.mac_stats.collision_count: 0 dev.em.1.mac_stats.symbol_errors: 0 dev.em.1.mac_stats.sequence_errors: 0 dev.em.1.mac_stats.defer_count: 0 dev.em.1.mac_stats.missed_packets: 0 dev.em.1.mac_stats.recv_no_buff: 0 dev.em.1.mac_stats.recv_undersize: 0 dev.em.1.mac_stats.recv_fragmented: 0 dev.em.1.mac_stats.recv_oversize: 0 dev.em.1.mac_stats.recv_jabber: 0 dev.em.1.mac_stats.recv_errs: 0 dev.em.1.mac_stats.crc_errs: 0 dev.em.1.mac_stats.alignment_errs: 0 dev.em.1.mac_stats.coll_ext_errs: 0 dev.em.1.mac_stats.xon_recvd: 0 dev.em.1.mac_stats.xon_txd: 0 dev.em.1.mac_stats.xoff_recvd: 0 dev.em.1.mac_stats.xoff_txd: 0 dev.em.1.mac_stats.total_pkts_recvd: 1450193674 dev.em.1.mac_stats.good_pkts_recvd: 1450193674 dev.em.1.mac_stats.bcast_pkts_recvd: 445039 dev.em.1.mac_stats.mcast_pkts_recvd: 245975 dev.em.1.mac_stats.rx_frames_64: 609064255 dev.em.1.mac_stats.rx_frames_65_127: 558214789 dev.em.1.mac_stats.rx_frames_128_255: 67510817 dev.em.1.mac_stats.rx_frames_256_511: 45074357 dev.em.1.mac_stats.rx_frames_512_1023: 44033937 dev.em.1.mac_stats.rx_frames_1024_1522: 126295519 dev.em.1.mac_stats.good_octets_recvd: 328237439148 dev.em.1.mac_stats.good_octets_txd: 3663318740841 dev.em.1.mac_stats.total_pkts_txd: 2978929548 dev.em.1.mac_stats.good_pkts_txd: 2978929548 dev.em.1.mac_stats.bcast_pkts_txd: 1323749 dev.em.1.mac_stats.mcast_pkts_txd: 1457541 dev.em.1.mac_stats.tx_frames_64: 182908152 dev.em.1.mac_stats.tx_frames_65_127: 230925677 dev.em.1.mac_stats.tx_frames_128_255: 72079980 dev.em.1.mac_stats.tx_frames_256_511: 66991884 dev.em.1.mac_stats.tx_frames_512_1023: 53886025 dev.em.1.mac_stats.tx_frames_1024_1522: 2372137830 dev.em.1.mac_stats.tso_txd: 0 dev.em.1.mac_stats.tso_ctx_fail: 0 dev.em.1.interrupts.asserts: 12 dev.em.1.interrupts.rx_pkt_timer: 0 dev.em.1.interrupts.rx_abs_timer: 0 dev.em.1.interrupts.tx_pkt_timer: 0 dev.em.1.interrupts.tx_abs_timer: 0 dev.em.1.interrupts.tx_queue_empty: 0 dev.em.1.interrupts.tx_queue_min_thresh: 0 dev.em.1.interrupts.rx_desc_min_thresh: 0 dev.em.1.interrupts.rx_overrun: 0 dev.em.2.%desc: Intel(R) PRO/1000 Network Connection 7.3.8 dev.em.2.%driver: em dev.em.2.%location: slot=0 function=0 dev.em.2.%pnpinfo: vendor=0x8086 device=0x10d3 subvendor=0x15d9 subdevice=0x10d3 class=0x020000 dev.em.2.%parent: pci3 dev.em.2.nvm: -1 dev.em.2.debug: -1 dev.em.2.fc: 3 dev.em.2.rx_int_delay: 0 dev.em.2.tx_int_delay: 66 dev.em.2.rx_abs_int_delay: 66 dev.em.2.tx_abs_int_delay: 66 dev.em.2.itr: 488 dev.em.2.rx_processing_limit: 100 dev.em.2.eee_control: 1 dev.em.2.link_irq: 163 dev.em.2.mbuf_alloc_fail: 0 dev.em.2.cluster_alloc_fail: 0 dev.em.2.dropped: 0 dev.em.2.tx_dma_fail: 0 dev.em.2.rx_overruns: 0 dev.em.2.watchdog_timeouts: 0 dev.em.2.device_control: 1477444168 dev.em.2.rx_control: 67141658 dev.em.2.fc_high_water: 18432 dev.em.2.fc_low_water: 16932 dev.em.2.queue0.txd_head: 521 dev.em.2.queue0.txd_tail: 521 dev.em.2.queue0.tx_irq: 939338710 dev.em.2.queue0.no_desc_avail: 0 dev.em.2.queue0.rxd_head: 855 dev.em.2.queue0.rxd_tail: 854 dev.em.2.queue0.rx_irq: 1045050498 dev.em.2.mac_stats.excess_coll: 0 dev.em.2.mac_stats.single_coll: 0 dev.em.2.mac_stats.multiple_coll: 0 dev.em.2.mac_stats.late_coll: 0 dev.em.2.mac_stats.collision_count: 0 dev.em.2.mac_stats.symbol_errors: 0 dev.em.2.mac_stats.sequence_errors: 0 dev.em.2.mac_stats.defer_count: 0 dev.em.2.mac_stats.missed_packets: 35387 dev.em.2.mac_stats.recv_no_buff: 90348 dev.em.2.mac_stats.recv_undersize: 0 dev.em.2.mac_stats.recv_fragmented: 0 dev.em.2.mac_stats.recv_oversize: 0 dev.em.2.mac_stats.recv_jabber: 0 dev.em.2.mac_stats.recv_errs: 0 dev.em.2.mac_stats.crc_errs: 0 dev.em.2.mac_stats.alignment_errs: 0 dev.em.2.mac_stats.coll_ext_errs: 0 dev.em.2.mac_stats.xon_recvd: 0 dev.em.2.mac_stats.xon_txd: 977 dev.em.2.mac_stats.xoff_recvd: 0 dev.em.2.mac_stats.xoff_txd: 36364 dev.em.2.mac_stats.total_pkts_recvd: 2980884602 dev.em.2.mac_stats.good_pkts_recvd: 2980849215 dev.em.2.mac_stats.bcast_pkts_recvd: 9368 dev.em.2.mac_stats.mcast_pkts_recvd: 19793 dev.em.2.mac_stats.rx_frames_64: 184179981 dev.em.2.mac_stats.rx_frames_65_127: 231989906 dev.em.2.mac_stats.rx_frames_128_255: 71730839 dev.em.2.mac_stats.rx_frames_256_511: 67062833 dev.em.2.mac_stats.rx_frames_512_1023: 53745746 dev.em.2.mac_stats.rx_frames_1024_1522: 2372139910 dev.em.2.mac_stats.good_octets_recvd: 3663366600056 dev.em.2.mac_stats.good_octets_txd: 328053006710 dev.em.2.mac_stats.total_pkts_txd: 1448672306 dev.em.2.mac_stats.good_pkts_txd: 1448634965 dev.em.2.mac_stats.bcast_pkts_txd: 33 dev.em.2.mac_stats.mcast_pkts_txd: 1264249 dev.em.2.mac_stats.tx_frames_64: 607631228 dev.em.2.mac_stats.tx_frames_65_127: 558298218 dev.em.2.mac_stats.tx_frames_128_255: 67492396 dev.em.2.mac_stats.tx_frames_256_511: 45024710 dev.em.2.mac_stats.tx_frames_512_1023: 43894730 dev.em.2.mac_stats.tx_frames_1024_1522: 126293683 dev.em.2.mac_stats.tso_txd: 0 dev.em.2.mac_stats.tso_ctx_fail: 0 dev.em.2.interrupts.asserts: 167 dev.em.2.interrupts.rx_pkt_timer: 0 dev.em.2.interrupts.rx_abs_timer: 0 dev.em.2.interrupts.tx_pkt_timer: 0 dev.em.2.interrupts.tx_abs_timer: 0 dev.em.2.interrupts.tx_queue_empty: 0 dev.em.2.interrupts.tx_queue_min_thresh: 0 dev.em.2.interrupts.rx_desc_min_thresh: 0 dev.em.2.interrupts.rx_overrun: 0
Stats on the firewalls' hardware are as follows:
-
Intel Atom D525 dual core 1.80GHz (HT disabled)
-
RAM 4GB DDR3
-
Motherboard SuperMicro MBD-X7SPE-HF-D525-O
-
dual Intel 82574L 1Gbps on-board NICs
-
Intel 1Gbps PCIe NIC for CARP
I can no longer trigger the issue by running a mere speed test. However, I have made these tweaks in an attempt to improve performance so maybe they are helping a bit but not completely solving the problem:
-
disabled hyperthreading in BIOS
-
net.inet.ip.fastforwarding=1
-
kern.ipc.nmbclusters="131072"
-
hw.em.num_queues=1
-
-
Upping nmbclusters is likely what made things better in general, prior to that it didn't have enough resources. Now it's likely an occasional short, large burst of traffic. Upping the advskew to make it less sensitive to scenarios like that is maybe the best bet, short of a faster system.
-
To prove your theory, what would be the best command(s) I could execute every second to a log file to monitor the resources that you believe are being exhausted?
Are there guidelines as to what type of specs are required to route a 500Mbps-1Gbps link? I thought I was purchasing something adequate to do routing, minimal firewalling and no NATing. Next time I want to know for sure that I'll have enough power. :)
-
What command in FreeBSD can I use to print out the number of concurrent connections so that I can start a log file and see if the number of connections coincide with dropped advertisements? In Linux it appears this data is in /proc/net/tcp but that file doesn't exist in pfSense. Is there an equivalent?
-
I believe the issue may be due to interrupts. From watching top (top -P CC), I see that the interrupts are only hitting one CPU. Is there a way to balance that load across both CPU's?
last pid: 79130; load averages: 0.33, 0.22, 0.15 up 85+19:36:44 13:05:01 38 processes: 1 running, 37 sleeping CPU 0: 0.0% user, 0.0% nice, 0.0% system, 54.3% interrupt, 45.7% idle CPU 1: 0.0% user, 0.0% nice, 1.1% system, 0.0% interrupt, 98.9% idle Mem: 84M Active, 32M Inact, 281M Wired, 1336K Cache, 91M Buf, 3524M Free