Shared Virtual IPs unexpectedly toggling between two CARP members on 2.1.4

andrew4902

I have two identical 3 NIC firewalls setup with CARP and running the 2.1.4 release. I've been getting sporadic email notices from my backup firewall to the effect of:

Carp cluster member "x.x.x.x - WAN Shared Virtual IP (wan_vip1)" has resumed the state "MASTER"
Carp cluster member "x.x.x.x - LAN Shared Virtual IP (lan_vip2)" has resumed the state "MASTER"

Carp cluster member "x.x.x.x - WAN Shared Virtual IP (wan_vip1)" has resumed the state "BACKUP"
Carp cluster member "x.x.x.x - LAN Shared Virtual IP (lan_vip2)" has resumed the state "BACKUP"

The two Virtual IPs toggle back and forth in tandem according to the email alerts however the change occurs so quickly I never see it in the web view. I wouldn't even know it is occurring except for the email alerts. I would like to find out what is causing these role reversal toggles and whether something needs to be corrected. The toggles are occurring on average about once per week. The only thing I have noticed that coincides with the toggles is higher than normal bandwidth usage. For example, I can typically produce the issue by running a speed test at speedtest.net (yields 328Mbps up and 152Mbps down on a 500Mbps up/down link).

andrew4902

Is there a log of some sort so I can determine what is triggering the CARP switch-over?

jimp

CARP events are logged in the main system log (Status > System Logs, General tab)

Are you using any IP Alias VIPs on top of the CARP VIPs? If so, there is a known issue with 2.1.4 and that combination, but there is a patch.

If you aren't, then it's likely not a config issue but something else on your switch/layer 2.

andrew4902

I don't believe I have IP Alias VIPs on top of CARP VIPs. I have dual pfSense routers setup as described in https://doc.pfsense.org/index.php/Configuring_pfSense_Hardware_Redundancy_(CARP). Each router has 3 interfaces and I have both a shared WAN VIP and LAN VIP. Those are my only 2 VIPs. The 4 LAN/WAN interfaces are plugged into an enterprise grade chassis switch and the CARP interfaces are connected directly to each other.

I checked the System Logs and the primary doesn't show anything at the time of the supposed fail-over. The backup unit shows this:

Aug 4 08:11:42 sshd[46683]: Did not receive identification string from x.x.x.x
Aug 4 13:35:12 kernel: lan_vip2: link state changed to UP
Aug 4 13:35:12 kernel: wan_vip1: link state changed to UP
Aug 4 13:35:12 kernel: lan_vip2: MASTER -> BACKUP (more frequent advertisement received)
Aug 4 13:35:12 kernel: wan_vip1: MASTER -> BACKUP (more frequent advertisement received)
Aug 4 13:35:12 kernel: lan_vip2: link state changed to DOWN
Aug 4 13:35:12 kernel: wan_vip1: link state changed to DOWN
Aug 4 13:35:14 php: rc.carpmaster: Message sent to my_email@my_domain.org OK
Aug 4 13:35:15 php: rc.carpmaster: Message sent to my_email@my_domain.org OK
Aug 4 13:35:17 php: rc.carpbackup: Message sent to my_email@my_domain.org OK
Aug 4 13:35:18 php: rc.carpbackup: Message sent to my_email@my_domain.org OK

Thanks for the information on logging! Do you have any suggestions on how I can investigate this further?

andrew4902

I upgraded to 2.1.5 but the problem persists. I still believe the problem is related to high network traffic through the master firewall. Stats on the firewalls' hardware are as follows:

duan Intel 82574L 1Gbps on-board NICs
Intel 1Gbps PCIe NIC for CARP
Intel Atom D525 dual core 1.80GHz
RAM 4GB DDR3
Motherboard SuperMicro MBD-X7SPE-HF-D525-O

I'm only doing routing (no NATing) with a couple of basic firewalls to block undesirable access to the firewalls. The CPU appears to be running about 90% idle. The most usage I see if from interrupt which averages around 10%. I'm wondering if I need to do some tuning of the system to be able to handle a 500 Mbps (up/down) Internet connection. Any recommendations?

cmb

What does the secondary log when it goes backup->master? That's likely more interesting than when it switches back (though may not be any more telling).

andrew4902

It appears that the message is merely "preempting a slower master."

Oct 20 15:37:50 borderrtr-02 check_reload_status: Reloading filter
Oct 20 15:38:04 borderrtr-02 kernel: wan_vip1: BACKUP -> MASTER (preempting a slower master)
Oct 20 15:38:04 borderrtr-02 kernel: wan_vip1: link state changed to UP
Oct 20 15:38:06 borderrtr-02 kernel: wan_vip1: MASTER -> BACKUP (more frequent advertisement received)
Oct 20 15:38:06 borderrtr-02 kernel: wan_vip1: link state changed to DOWN
Oct 20 15:38:06 borderrtr-02 php: rc.carpmaster: Message sent to email@mydomain.com OK
Oct 20 15:38:08 borderrtr-02 php: rc.carpbackup: Message sent to email@mydomain.com OK
Oct 20 15:41:17 borderrtr-02 kernel: wan_vip1: link state changed to UP
Oct 20 15:41:17 borderrtr-02 kernel: lan_vip2: link state changed to UP
Oct 20 15:41:17 borderrtr-02 kernel: lan_vip2: MASTER -> BACKUP (more frequent advertisement received)
Oct 20 15:41:17 borderrtr-02 kernel: wan_vip1: MASTER -> BACKUP (more frequent advertisement received)
Oct 20 15:41:17 borderrtr-02 kernel: lan_vip2: link state changed to DOWN
Oct 20 15:41:17 borderrtr-02 kernel: wan_vip1: link state changed to DOWN
Oct 20 15:41:19 borderrtr-02 php: rc.carpmaster: Message sent to email@mydomain.com OK

To attempt to improve the performance of the master, I have disabled hyper-threading and "sysctl net.inet.ip.fastforwarding=1." I found these suggested tweaks in many locations. However, neither of these tweaks has resolved the issue. I don't know whether it is directly related or not but when I watch the TOP command, I see that the interrupt value can go as high as 50% which seems to be when the switch-over occurs.

andrew4902

Don't know if this helpful or not but here is my interrupt usage info:

vmstat -i
interrupt                          total       rate
irq18: ehci0 uhci5                 12813          0
irq19: uhci2 uhci4*                   21          0
irq23: uhci3 ehci1                    32          0
cpu0: timer                    119266502       1992
irq256: em0:rx 0                  132655          2
irq257: em0:tx 0                10814616        180
irq258: em0:link                       1          0
irq259: em1:rx 0                23300800        389
irq260: em1:tx 0                21737781        363
irq261: em1:link                       2          0
irq262: em2:rx 0                21674367        362
irq263: em2:tx 0                23499883        392
irq264: em2:link                     199          0
cpu1: timer                    119266493       1992
Total                          339706165       5673

cmb

That tells you the secondary briefly stopped receiving CARP's multicast from the primary. Whether that's because the primary stopped sending it, or something in between the primary and secondary stopped passing it along, is the question now. Packet capture on WAN filtering on CARP (host address 224.0.0.18 will suffice). Next time it happens, stop and review that capture and see if there is a gap that coincides with the change. Should see 1 packet per second.

andrew4902

What is the best way to run an extended packet capture on an embedded system running from a flash drive?

cmb

How much RAM you have in the system, and how big is the storage medium? You can use a pretty small snap length, but we're still talking 64 bytes per second, and the default RAM disks are pretty small for something like that. Still, if it happens within at most a few days, it should be practical. It might be best to increase the RAM disk size, or might be best to just leave it rw mounted for a few days (not going to hurt anything) capturing to the flash.

andrew4902

I have 4 GB of RAM and a 4 GB flash drive (but am using the 2 GB image).

andrew4902

cmb, you were correct. I ran tcpdump on the WAN NIC of the backup and it appears to drop for a couple seconds every now and then:

09:25:01.549258 IP <master ip="">> 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 0, authtype none, intvl 1s, length 36
09:25:02.555175 IP <master ip="">> 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 0, authtype none, intvl 1s, length 36
09:25:05.960868 IP <backup ip="">> 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 100, authtype none, intvl 1s, length 36
09:25:06.146762 IP <master ip="">> 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 0, authtype none, intvl 1s, length 36

09:41:14.115512 IP <master ip="">> 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 0, authtype none, intvl 1s, length 36
09:41:15.895334 IP <master ip="">> 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 0, authtype none, intvl 1s, length 36
09:41:19.300492 IP <backup ip="">> 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 100, authtype none, intvl 1s, length 36
09:41:19.315849 IP <master ip="">> 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 0, authtype none, intvl 1s, length 36</master></backup></master></master></master></backup></master></master>

I'm now going to start a corresponding tcpdump on the master's WAN link too as a cross reference to see if the packets are being sent. I find it hard to believe that packets are being lost between the master and backup because they are connected to an enterprise chassis/switch on an isolated VLAN. My suspicion at this point is that the master is getting too busy and is either delaying or occasionally not sending the packets but the tcpdump should show that.

cmb

This is simple enough that yeah, if you can leave a SSH session up running tcpdump like that, that'll suffice. Then you're not chewing up any RAM or storage space.

It's definitely that either the primary stops sending it, or it disappears after leaving the primary. Don't be so quick to write off switches as the possible cause, when we hit these scenarios with support customers, it's something outside the firewall more often than not. Even "enterprise" switches have issues with multicast from time to time.

Though if you can trigger it simply by running a speed test, I'd guess it's not likely the network. That would also be the most touchy scenario, by far, I'd ever heard of where load caused missed CARP advertisements. The only scenarios of those I've seen is where a huge flood of new connections (relative to the CPU of the hardware) comes in over a very short period of time and briefly overwhelms the system. The first one I can recall seeing in the real world was a number of years ago, someone with an ALIX in their colocation rack, with a mail server trying to blast out tens of thousands of connections in less than 1 second. It actually handled it quite nicely considering that's a slow Geode 500 MHz CPU typically suited for SOHO networks, not colo. It passed most of the traffic, but caused it to miss a couple advertisements, just enough to trigger failover. DDoS attacks among the more common scenarios. Anything that creates a significant flood of new connections in a short period.

andrew4902

It took a few days to observer a fail-over but 3 occurred over the course of less than one minute. I was running TCPDUMP on the master and it shows that the master stopped sending the advertisements and even observed the secondary sending out the advertisement during the missed interval:

08:35:31.338307 IP <master ip="">> 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 0, authtype none, intvl 1s, length 36
08:35:32.839573 IP <master ip="">> 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 0, authtype none, intvl 1s, length 36
08:35:36.249499 IP <backup ip="">> 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 100, authtype none, intvl 1s, length 36
08:35:36.425591 IP <master ip="">> 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 0, authtype none, intvl 1s, length 36
08:35:37.430875 IP <master ip="">> 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 0, authtype none, intvl 1s, length 36

08:35:56.525097 IP <master ip="">> 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 0, authtype none, intvl 1s, length 36
08:35:57.529976 IP <master ip="">> 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 0, authtype none, intvl 1s, length 36
08:36:00.936304 IP <backup ip="">> 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 100, authtype none, intvl 1s, length 36
08:36:01.073325 IP <master ip="">> 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 0, authtype none, intvl 1s, length 36
08:36:02.442006 IP <master ip="">> 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 0, authtype none, intvl 1s, length 36</master></master></backup></master></master></master></master></backup></master></master>

What is your suggestion on what I should try next to isolate why the advertisements are occasionally not being sent?

cmb

What does the output of the following commands show?

netstat -m 
sysctl dev.em

What are the details of your hardware?

Can you still reliably trigger it just running a speed test? That indicates a much bigger issue. I'm guessing it's nowhere near that simple. It's likely a large flood of something in a very short period that's overwhelming the system.

andrew4902

netstat -m

3518/3397/6915 mbufs in use (current/cache/total)
3103/3355/6458/131072 mbuf clusters in use (current/cache/total/max)
3101/2147 mbuf+clusters out of packet secondary zone in use (current/cache)
0/44/44/65536 4k (page size) jumbo clusters in use (current/cache/total/max)
0/0/0/32768 9k jumbo clusters in use (current/cache/total/max)
0/0/0/16384 16k jumbo clusters in use (current/cache/total/max)
7097K/7735K/14832K bytes allocated to network (current/cache/total)
0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters denied (4k/9k/16k)
0/0/0 sfbufs in use (current/peak/max)
0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile
0 calls to protocol drain routines

sysctl dev.em (em0=CARP; em1=LAN; em2=WAN)

dev.em.0.%desc: Intel(R) PRO/1000 Network Connection 7.3.8
dev.em.0.%driver: em
dev.em.0.%location: slot=0 function=0
dev.em.0.%pnpinfo: vendor=0x8086 device=0x10d3 subvendor=0x8086 subdevice=0xa01f class=0x020000
dev.em.0.%parent: pci1
dev.em.0.nvm: -1
dev.em.0.debug: -1
dev.em.0.fc: 3
dev.em.0.rx_int_delay: 0
dev.em.0.tx_int_delay: 66
dev.em.0.rx_abs_int_delay: 66
dev.em.0.tx_abs_int_delay: 66
dev.em.0.itr: 488
dev.em.0.rx_processing_limit: 100
dev.em.0.eee_control: 1
dev.em.0.link_irq: 2
dev.em.0.mbuf_alloc_fail: 0
dev.em.0.cluster_alloc_fail: 0
dev.em.0.dropped: 0
dev.em.0.tx_dma_fail: 0
dev.em.0.rx_overruns: 0
dev.em.0.watchdog_timeouts: 0
dev.em.0.device_control: 1477444168
dev.em.0.rx_control: 67141634
dev.em.0.fc_high_water: 18432
dev.em.0.fc_low_water: 16932
dev.em.0.queue0.txd_head: 866
dev.em.0.queue0.txd_tail: 866
dev.em.0.queue0.tx_irq: 593181766
dev.em.0.queue0.no_desc_avail: 0
dev.em.0.queue0.rxd_head: 519
dev.em.0.queue0.rxd_tail: 518
dev.em.0.queue0.rx_irq: 2718514
dev.em.0.mac_stats.excess_coll: 0
dev.em.0.mac_stats.single_coll: 0
dev.em.0.mac_stats.multiple_coll: 0
dev.em.0.mac_stats.late_coll: 0
dev.em.0.mac_stats.collision_count: 0
dev.em.0.mac_stats.symbol_errors: 0
dev.em.0.mac_stats.sequence_errors: 0
dev.em.0.mac_stats.defer_count: 0
dev.em.0.mac_stats.missed_packets: 0
dev.em.0.mac_stats.recv_no_buff: 0
dev.em.0.mac_stats.recv_undersize: 0
dev.em.0.mac_stats.recv_fragmented: 0
dev.em.0.mac_stats.recv_oversize: 0
dev.em.0.mac_stats.recv_jabber: 0
dev.em.0.mac_stats.recv_errs: 0
dev.em.0.mac_stats.crc_errs: 0
dev.em.0.mac_stats.alignment_errs: 0
dev.em.0.mac_stats.coll_ext_errs: 0
dev.em.0.mac_stats.xon_recvd: 0
dev.em.0.mac_stats.xon_txd: 0
dev.em.0.mac_stats.xoff_recvd: 0
dev.em.0.mac_stats.xoff_txd: 0
dev.em.0.mac_stats.total_pkts_recvd: 2948615
dev.em.0.mac_stats.good_pkts_recvd: 2948615
dev.em.0.mac_stats.bcast_pkts_recvd: 1027
dev.em.0.mac_stats.mcast_pkts_recvd: 2419003
dev.em.0.mac_stats.rx_frames_64: 1835
dev.em.0.mac_stats.rx_frames_65_127: 660675
dev.em.0.mac_stats.rx_frames_128_255: 253676
dev.em.0.mac_stats.rx_frames_256_511: 1256024
dev.em.0.mac_stats.rx_frames_512_1023: 774274
dev.em.0.mac_stats.rx_frames_1024_1522: 2131
dev.em.0.mac_stats.good_octets_recvd: 1050155434
dev.em.0.mac_stats.good_octets_txd: 399584285100
dev.em.0.mac_stats.total_pkts_txd: 716385169
dev.em.0.mac_stats.good_pkts_txd: 716385169
dev.em.0.mac_stats.bcast_pkts_txd: 809
dev.em.0.mac_stats.mcast_pkts_txd: 716038095
dev.em.0.mac_stats.tx_frames_64: 1842
dev.em.0.mac_stats.tx_frames_65_127: 1445429
dev.em.0.mac_stats.tx_frames_128_255: 42164509
dev.em.0.mac_stats.tx_frames_256_511: 23937085
dev.em.0.mac_stats.tx_frames_512_1023: 646458006
dev.em.0.mac_stats.tx_frames_1024_1522: 2378298
dev.em.0.mac_stats.tso_txd: 0
dev.em.0.mac_stats.tso_ctx_fail: 0
dev.em.0.interrupts.asserts: 6
dev.em.0.interrupts.rx_pkt_timer: 0
dev.em.0.interrupts.rx_abs_timer: 0
dev.em.0.interrupts.tx_pkt_timer: 0
dev.em.0.interrupts.tx_abs_timer: 0
dev.em.0.interrupts.tx_queue_empty: 0
dev.em.0.interrupts.tx_queue_min_thresh: 0
dev.em.0.interrupts.rx_desc_min_thresh: 0
dev.em.0.interrupts.rx_overrun: 0
dev.em.1.%desc: Intel(R) PRO/1000 Network Connection 7.3.8
dev.em.1.%driver: em
dev.em.1.%location: slot=0 function=0
dev.em.1.%pnpinfo: vendor=0x8086 device=0x10d3 subvendor=0x15d9 subdevice=0x10d3 class=0x020000
dev.em.1.%parent: pci2
dev.em.1.nvm: -1
dev.em.1.debug: -1
dev.em.1.fc: 3
dev.em.1.rx_int_delay: 0
dev.em.1.tx_int_delay: 66
dev.em.1.rx_abs_int_delay: 66
dev.em.1.tx_abs_int_delay: 66
dev.em.1.itr: 488
dev.em.1.rx_processing_limit: 100
dev.em.1.eee_control: 1
dev.em.1.link_irq: 2
dev.em.1.mbuf_alloc_fail: 0
dev.em.1.cluster_alloc_fail: 0
dev.em.1.dropped: 0
dev.em.1.tx_dma_fail: 0
dev.em.1.rx_overruns: 0
dev.em.1.watchdog_timeouts: 0
dev.em.1.device_control: 1477444168
dev.em.1.rx_control: 67141658
dev.em.1.fc_high_water: 18432
dev.em.1.fc_low_water: 16932
dev.em.1.queue0.txd_head: 457
dev.em.1.queue0.txd_tail: 458
dev.em.1.queue0.tx_irq: 1022512833
dev.em.1.queue0.no_desc_avail: 0
dev.em.1.queue0.rxd_head: 569
dev.em.1.queue0.rxd_tail: 568
dev.em.1.queue0.rx_irq: 933448236
dev.em.1.mac_stats.excess_coll: 0
dev.em.1.mac_stats.single_coll: 0
dev.em.1.mac_stats.multiple_coll: 0
dev.em.1.mac_stats.late_coll: 0
dev.em.1.mac_stats.collision_count: 0
dev.em.1.mac_stats.symbol_errors: 0
dev.em.1.mac_stats.sequence_errors: 0
dev.em.1.mac_stats.defer_count: 0
dev.em.1.mac_stats.missed_packets: 0
dev.em.1.mac_stats.recv_no_buff: 0
dev.em.1.mac_stats.recv_undersize: 0
dev.em.1.mac_stats.recv_fragmented: 0
dev.em.1.mac_stats.recv_oversize: 0
dev.em.1.mac_stats.recv_jabber: 0
dev.em.1.mac_stats.recv_errs: 0
dev.em.1.mac_stats.crc_errs: 0
dev.em.1.mac_stats.alignment_errs: 0
dev.em.1.mac_stats.coll_ext_errs: 0
dev.em.1.mac_stats.xon_recvd: 0
dev.em.1.mac_stats.xon_txd: 0
dev.em.1.mac_stats.xoff_recvd: 0
dev.em.1.mac_stats.xoff_txd: 0
dev.em.1.mac_stats.total_pkts_recvd: 1450193674
dev.em.1.mac_stats.good_pkts_recvd: 1450193674
dev.em.1.mac_stats.bcast_pkts_recvd: 445039
dev.em.1.mac_stats.mcast_pkts_recvd: 245975
dev.em.1.mac_stats.rx_frames_64: 609064255
dev.em.1.mac_stats.rx_frames_65_127: 558214789
dev.em.1.mac_stats.rx_frames_128_255: 67510817
dev.em.1.mac_stats.rx_frames_256_511: 45074357
dev.em.1.mac_stats.rx_frames_512_1023: 44033937
dev.em.1.mac_stats.rx_frames_1024_1522: 126295519
dev.em.1.mac_stats.good_octets_recvd: 328237439148
dev.em.1.mac_stats.good_octets_txd: 3663318740841
dev.em.1.mac_stats.total_pkts_txd: 2978929548
dev.em.1.mac_stats.good_pkts_txd: 2978929548
dev.em.1.mac_stats.bcast_pkts_txd: 1323749
dev.em.1.mac_stats.mcast_pkts_txd: 1457541
dev.em.1.mac_stats.tx_frames_64: 182908152
dev.em.1.mac_stats.tx_frames_65_127: 230925677
dev.em.1.mac_stats.tx_frames_128_255: 72079980
dev.em.1.mac_stats.tx_frames_256_511: 66991884
dev.em.1.mac_stats.tx_frames_512_1023: 53886025
dev.em.1.mac_stats.tx_frames_1024_1522: 2372137830
dev.em.1.mac_stats.tso_txd: 0
dev.em.1.mac_stats.tso_ctx_fail: 0
dev.em.1.interrupts.asserts: 12
dev.em.1.interrupts.rx_pkt_timer: 0
dev.em.1.interrupts.rx_abs_timer: 0
dev.em.1.interrupts.tx_pkt_timer: 0
dev.em.1.interrupts.tx_abs_timer: 0
dev.em.1.interrupts.tx_queue_empty: 0
dev.em.1.interrupts.tx_queue_min_thresh: 0
dev.em.1.interrupts.rx_desc_min_thresh: 0
dev.em.1.interrupts.rx_overrun: 0
dev.em.2.%desc: Intel(R) PRO/1000 Network Connection 7.3.8
dev.em.2.%driver: em
dev.em.2.%location: slot=0 function=0
dev.em.2.%pnpinfo: vendor=0x8086 device=0x10d3 subvendor=0x15d9 subdevice=0x10d3 class=0x020000
dev.em.2.%parent: pci3
dev.em.2.nvm: -1
dev.em.2.debug: -1
dev.em.2.fc: 3
dev.em.2.rx_int_delay: 0
dev.em.2.tx_int_delay: 66
dev.em.2.rx_abs_int_delay: 66
dev.em.2.tx_abs_int_delay: 66
dev.em.2.itr: 488
dev.em.2.rx_processing_limit: 100
dev.em.2.eee_control: 1
dev.em.2.link_irq: 163
dev.em.2.mbuf_alloc_fail: 0
dev.em.2.cluster_alloc_fail: 0
dev.em.2.dropped: 0
dev.em.2.tx_dma_fail: 0
dev.em.2.rx_overruns: 0
dev.em.2.watchdog_timeouts: 0
dev.em.2.device_control: 1477444168
dev.em.2.rx_control: 67141658
dev.em.2.fc_high_water: 18432
dev.em.2.fc_low_water: 16932
dev.em.2.queue0.txd_head: 521
dev.em.2.queue0.txd_tail: 521
dev.em.2.queue0.tx_irq: 939338710
dev.em.2.queue0.no_desc_avail: 0
dev.em.2.queue0.rxd_head: 855
dev.em.2.queue0.rxd_tail: 854
dev.em.2.queue0.rx_irq: 1045050498
dev.em.2.mac_stats.excess_coll: 0
dev.em.2.mac_stats.single_coll: 0
dev.em.2.mac_stats.multiple_coll: 0
dev.em.2.mac_stats.late_coll: 0
dev.em.2.mac_stats.collision_count: 0
dev.em.2.mac_stats.symbol_errors: 0
dev.em.2.mac_stats.sequence_errors: 0
dev.em.2.mac_stats.defer_count: 0
dev.em.2.mac_stats.missed_packets: 35387
dev.em.2.mac_stats.recv_no_buff: 90348
dev.em.2.mac_stats.recv_undersize: 0
dev.em.2.mac_stats.recv_fragmented: 0
dev.em.2.mac_stats.recv_oversize: 0
dev.em.2.mac_stats.recv_jabber: 0
dev.em.2.mac_stats.recv_errs: 0
dev.em.2.mac_stats.crc_errs: 0
dev.em.2.mac_stats.alignment_errs: 0
dev.em.2.mac_stats.coll_ext_errs: 0
dev.em.2.mac_stats.xon_recvd: 0
dev.em.2.mac_stats.xon_txd: 977
dev.em.2.mac_stats.xoff_recvd: 0
dev.em.2.mac_stats.xoff_txd: 36364
dev.em.2.mac_stats.total_pkts_recvd: 2980884602
dev.em.2.mac_stats.good_pkts_recvd: 2980849215
dev.em.2.mac_stats.bcast_pkts_recvd: 9368
dev.em.2.mac_stats.mcast_pkts_recvd: 19793
dev.em.2.mac_stats.rx_frames_64: 184179981
dev.em.2.mac_stats.rx_frames_65_127: 231989906
dev.em.2.mac_stats.rx_frames_128_255: 71730839
dev.em.2.mac_stats.rx_frames_256_511: 67062833
dev.em.2.mac_stats.rx_frames_512_1023: 53745746
dev.em.2.mac_stats.rx_frames_1024_1522: 2372139910
dev.em.2.mac_stats.good_octets_recvd: 3663366600056
dev.em.2.mac_stats.good_octets_txd: 328053006710
dev.em.2.mac_stats.total_pkts_txd: 1448672306
dev.em.2.mac_stats.good_pkts_txd: 1448634965
dev.em.2.mac_stats.bcast_pkts_txd: 33
dev.em.2.mac_stats.mcast_pkts_txd: 1264249
dev.em.2.mac_stats.tx_frames_64: 607631228
dev.em.2.mac_stats.tx_frames_65_127: 558298218
dev.em.2.mac_stats.tx_frames_128_255: 67492396
dev.em.2.mac_stats.tx_frames_256_511: 45024710
dev.em.2.mac_stats.tx_frames_512_1023: 43894730
dev.em.2.mac_stats.tx_frames_1024_1522: 126293683
dev.em.2.mac_stats.tso_txd: 0
dev.em.2.mac_stats.tso_ctx_fail: 0
dev.em.2.interrupts.asserts: 167
dev.em.2.interrupts.rx_pkt_timer: 0
dev.em.2.interrupts.rx_abs_timer: 0
dev.em.2.interrupts.tx_pkt_timer: 0
dev.em.2.interrupts.tx_abs_timer: 0
dev.em.2.interrupts.tx_queue_empty: 0
dev.em.2.interrupts.tx_queue_min_thresh: 0
dev.em.2.interrupts.rx_desc_min_thresh: 0
dev.em.2.interrupts.rx_overrun: 0

Stats on the firewalls' hardware are as follows:

Intel Atom D525 dual core 1.80GHz (HT disabled)
RAM 4GB DDR3
Motherboard SuperMicro MBD-X7SPE-HF-D525-O
dual Intel 82574L 1Gbps on-board NICs
Intel 1Gbps PCIe NIC for CARP

I can no longer trigger the issue by running a mere speed test. However, I have made these tweaks in an attempt to improve performance so maybe they are helping a bit but not completely solving the problem:

disabled hyperthreading in BIOS
net.inet.ip.fastforwarding=1
kern.ipc.nmbclusters="131072"
hw.em.num_queues=1

cmb

Upping nmbclusters is likely what made things better in general, prior to that it didn't have enough resources. Now it's likely an occasional short, large burst of traffic. Upping the advskew to make it less sensitive to scenarios like that is maybe the best bet, short of a faster system.

andrew4902

To prove your theory, what would be the best command(s) I could execute every second to a log file to monitor the resources that you believe are being exhausted?

Are there guidelines as to what type of specs are required to route a 500Mbps-1Gbps link? I thought I was purchasing something adequate to do routing, minimal firewalling and no NATing. Next time I want to know for sure that I'll have enough power. :)

andrew4902

What command in FreeBSD can I use to print out the number of concurrent connections so that I can start a log file and see if the number of connections coincide with dropped advertisements? In Linux it appears this data is in /proc/net/tcp but that file doesn't exist in pfSense. Is there an equivalent?