VLAN Interrupt storm solutions? pf 2.03 / X7SPA-HF / Intel 82574L
-
I have 2x Intel 82574L NICs on a Supermicro D510.
em0 WAN1 (2 VLANS)
em1 WAN2 + LAN (3 VLANS)em1 gets interrupt storms and freezes the entire box.
This happens with at low throughput but high number of connections.No IRQs are shared, BIOS manages IRQs and unused devices are off.
Checking or unchecking the following makes no difference
Device polling
Hardware Checksum Offloading
Hardware TCP Segmentation Offloading
Hardware Large Receive OffloadingIs there a software change that would fix this?
-
@themisa:
em1 gets interrupt storms and freezes the entire box.
. . .
No IRQs are shared, BIOS manages IRQs and unused devices are off.If em1 truly is the sole user of its irq then the interrupt storm report suggests the em driver isn't correctly clearing an interrupt condition on em1. PERHAPS the newer device driver in pfSense 2.1 snapshot builds will work better.
-
i use a big chunk of pf features, chances are 2.1 will just introduce new issues
this is an older server grade motherboard with an intel driver, if it's not compatible i don't know what isare there any useful commands i can run to narrow down the issue to something more specific?
-
@themisa:
this is an older server grade motherboard with an intel driver, if it's not compatible i don't know what is
New devices are not always "compatible" with old device drivers. pfSense 2.0.x is based on FreeBSD 8.1 which is about two years older than FreeBSD 8.3 used in the pfSense 2.1 snapshot builds.
@themisa:
are there any useful commands i can run to narrow down the issue to something more specific?
Some of the em devices have a feature called "interrupt moderation" - interrupts are delayed a programmable period to reduce the overhead of the interrupt by giving a single interrupt more work to do. For example, on a busy interface by delaying an interrupt request by (say) 100 microseconds there might be 10 receive packets available for processing rather than one. Please post the output of pfSense shell command```
sysctl -a | grep em1PERHAPS tweaking the interrupt moderation will reduce the interrupt storm reports. It could also be useful to post the output of pfSense shell commands:``` ifconfig vmstat -i
-
interrupt total rate
irq18: ehci0 uhci5 2 0
irq19: uhci2 uhci4+ 19 0
irq23: uhci3 ehci1 97485 2
cpu0: timer 17324399 399
irq256: em0:rx 0 5371292 124
irq257: em0:tx 0 5694835 131
irq258: em0:link 15281 0
irq259: em1:rx 0 2296709 53
irq260: em1:tx 0 2130004 49
irq261: em1:link 19916 0
cpu1: timer 17320395 399
cpu2: timer 17320395 399
cpu3: timer 17320395 399
Total 84911127 1960"sysctl -a | grep em1" doesn't find anything
thanks for trying wallabybob
time to start thinking new hardware -
Try em.1:
[2.0.3-RELEASE][root@pfsense.fire.box]/root(1): sysctl -a|grep em1 [2.0.3-RELEASE][root@pfsense.fire.box]/root(2): sysctl -a | grep em.1 dev.em.1.%desc: Intel(R) PRO/1000 Legacy Network Connection 1.0.4 dev.em.1.%driver: em dev.em.1.%location: slot=14 function=0 dev.em.1.%pnpinfo: vendor=0x8086 device=0x1079 subvendor=0x8086 subdevice=0x1179 class=0x020000 dev.em.1.%parent: pci3 dev.em.1.nvm: -1 dev.em.1.rx_int_delay: 0 dev.em.1.tx_int_delay: 66 dev.em.1.rx_abs_int_delay: 66 dev.em.1.tx_abs_int_delay: 66 dev.em.1.rx_processing_limit: 100 dev.em.1.flow_control: 3 dev.em.1.mbuf_alloc_fail: 0 dev.em.1.cluster_alloc_fail: 0 dev.em.1.dropped: 0 dev.em.1.tx_dma_fail: 0 dev.em.1.tx_desc_fail1: 0 dev.em.1.tx_desc_fail2: 0 dev.em.1.rx_overruns: 0 dev.em.1.watchdog_timeouts: 0 dev.em.1.device_control: 1492124233 dev.em.1.rx_control: 32770 dev.em.1.fc_high_water: 47104 dev.em.1.fc_low_water: 45604 dev.em.1.fifo_workaround: 0 dev.em.1.fifo_reset: 0 dev.em.1.txd_head: 83 dev.em.1.txd_tail: 84 dev.em.1.rxd_head: 191 dev.em.1.rxd_tail: 190 dev.em.1.mac_stats.excess_coll: 0 dev.em.1.mac_stats.single_coll: 0 dev.em.1.mac_stats.multiple_coll: 0 dev.em.1.mac_stats.late_coll: 0 dev.em.1.mac_stats.collision_count: 0 dev.em.1.mac_stats.symbol_errors: 0 dev.em.1.mac_stats.sequence_errors: 0 dev.em.1.mac_stats.defer_count: 0 dev.em.1.mac_stats.missed_packets: 0 dev.em.1.mac_stats.recv_no_buff: 0 dev.em.1.mac_stats.recv_undersize: 0 dev.em.1.mac_stats.recv_fragmented: 0 dev.em.1.mac_stats.recv_oversize: 0 dev.em.1.mac_stats.recv_jabber: 0 dev.em.1.mac_stats.recv_errs: 0 dev.em.1.mac_stats.crc_errs: 0 dev.em.1.mac_stats.alignment_errs: 0 dev.em.1.mac_stats.coll_ext_errs: 0 dev.em.1.mac_stats.xon_recvd: 0 dev.em.1.mac_stats.xon_txd: 0 dev.em.1.mac_stats.xoff_recvd: 0 dev.em.1.mac_stats.xoff_txd: 0 dev.em.1.mac_stats.total_pkts_recvd: 34413999 dev.em.1.mac_stats.good_pkts_recvd: 34413999 dev.em.1.mac_stats.bcast_pkts_recvd: 32180 dev.em.1.mac_stats.mcast_pkts_recvd: 0 dev.em.1.mac_stats.rx_frames_64: 6363096 dev.em.1.mac_stats.rx_frames_65_127: 17326141 dev.em.1.mac_stats.rx_frames_128_255: 6554914 dev.em.1.mac_stats.rx_frames_256_511: 1041309 dev.em.1.mac_stats.rx_frames_512_1023: 1386613 dev.em.1.mac_stats.rx_frames_1024_1522: 1741926 dev.em.1.mac_stats.good_octets_recvd: 6621995344 dev.em.1.mac_stats.good_octets_txd: 28106065051 dev.em.1.mac_stats.total_pkts_txd: 40379356 dev.em.1.mac_stats.good_pkts_txd: 40379356 dev.em.1.mac_stats.bcast_pkts_txd: 26373 dev.em.1.mac_stats.mcast_pkts_txd: 5 dev.em.1.mac_stats.tx_frames_64: 1438817 dev.em.1.mac_stats.tx_frames_65_127: 12731412 dev.em.1.mac_stats.tx_frames_128_255: 6641964 dev.em.1.mac_stats.tx_frames_256_511: 1181055 dev.em.1.mac_stats.tx_frames_512_1023: 1178482 dev.em.1.mac_stats.tx_frames_1024_1522: 17207626 dev.em.1.mac_stats.tso_txd: 0 dev.em.1.mac_stats.tso_ctx_fail: 0
Steve
-
good catch
em0 gets interrupt storms not em1, sorry, i had them reversed, it's like this
em0 WAN2 + LAN (3 VLANS)
em1 WAN1 (2 VLANS)
so em0 has more traffic due to lan.dev.em.0.%desc: Intel(R) PRO/1000 Network Connection 7.3.2 dev.em.0.%driver: em dev.em.0.%location: slot=0 function=0 dev.em.0.%pnpinfo: vendor=0x8086 device=0x10d3 subvendor=0x15d9 subdevice=0x060a class=0x020000 dev.em.0.%parent: pci2 dev.em.0.nvm: -1 dev.em.0.debug: -1 dev.em.0.fc: 3 dev.em.0.rx_int_delay: 0 dev.em.0.tx_int_delay: 66 dev.em.0.rx_abs_int_delay: 66 dev.em.0.tx_abs_int_delay: 66 dev.em.0.rx_processing_limit: 100 dev.em.0.eee_control: 0 dev.em.0.link_irq: 76 dev.em.0.mbuf_alloc_fail: 0 dev.em.0.cluster_alloc_fail: 0 dev.em.0.dropped: 0 dev.em.0.tx_dma_fail: 0 dev.em.0.rx_overruns: 0 dev.em.0.watchdog_timeouts: 0 dev.em.0.device_control: 1074790984 dev.em.0.rx_control: 67403778 dev.em.0.fc_high_water: 18432 dev.em.0.fc_low_water: 16932 dev.em.0.queue0.txd_head: 14 dev.em.0.queue0.txd_tail: 14 dev.em.0.queue0.tx_irq: 106328859 dev.em.0.queue0.no_desc_avail: 0 dev.em.0.queue0.rxd_head: 541 dev.em.0.queue0.rxd_tail: 540 dev.em.0.queue0.rx_irq: 120560838 dev.em.0.mac_stats.excess_coll: 0 dev.em.0.mac_stats.single_coll: 0 dev.em.0.mac_stats.multiple_coll: 0 dev.em.0.mac_stats.late_coll: 0 dev.em.0.mac_stats.collision_count: 0 dev.em.0.mac_stats.symbol_errors: 0 dev.em.0.mac_stats.sequence_errors: 0 dev.em.0.mac_stats.defer_count: 0 dev.em.0.mac_stats.missed_packets: 309717 dev.em.0.mac_stats.recv_no_buff: 119595 dev.em.0.mac_stats.recv_undersize: 0 dev.em.0.mac_stats.recv_fragmented: 0 dev.em.0.mac_stats.recv_oversize: 0 dev.em.0.mac_stats.recv_jabber: 0 dev.em.0.mac_stats.recv_errs: 0 dev.em.0.mac_stats.crc_errs: 0 dev.em.0.mac_stats.alignment_errs: 0 dev.em.0.mac_stats.coll_ext_errs: 0 dev.em.0.mac_stats.xon_recvd: 0 dev.em.0.mac_stats.xon_txd: 0 dev.em.0.mac_stats.xoff_recvd: 0 dev.em.0.mac_stats.xoff_txd: 0 dev.em.0.mac_stats.total_pkts_recvd: 193440710 dev.em.0.mac_stats.good_pkts_recvd: 193130993 dev.em.0.mac_stats.bcast_pkts_recvd: 32757 dev.em.0.mac_stats.mcast_pkts_recvd: 467191 dev.em.0.mac_stats.rx_frames_64: 16348866 dev.em.0.mac_stats.rx_frames_65_127: 96904487 dev.em.0.mac_stats.rx_frames_128_255: 47080072 dev.em.0.mac_stats.rx_frames_256_511: 12219416 dev.em.0.mac_stats.rx_frames_512_1023: 3441920 dev.em.0.mac_stats.rx_frames_1024_1522: 17136232 dev.em.0.mac_stats.good_octets_recvd: 48932482221 dev.em.0.mac_stats.good_octets_txd: 86853271375 dev.em.0.mac_stats.total_pkts_txd: 159394546 dev.em.0.mac_stats.good_pkts_txd: 159394546 dev.em.0.mac_stats.bcast_pkts_txd: 11344 dev.em.0.mac_stats.mcast_pkts_txd: 114483 dev.em.0.mac_stats.tx_frames_64: 2459808 dev.em.0.mac_stats.tx_frames_65_127: 59291145 dev.em.0.mac_stats.tx_frames_128_255: 25944268 dev.em.0.mac_stats.tx_frames_256_511: 22065346 dev.em.0.mac_stats.tx_frames_512_1023: 5965944 dev.em.0.mac_stats.tx_frames_1024_1522: 43668035 dev.em.0.mac_stats.tso_txd: 0 dev.em.0.mac_stats.tso_ctx_fail: 0 dev.em.0.interrupts.asserts: 77 dev.em.0.interrupts.rx_pkt_timer: 0 dev.em.0.interrupts.rx_abs_timer: 0 dev.em.0.interrupts.tx_pkt_timer: 0 dev.em.0.interrupts.tx_abs_timer: 0 dev.em.0.interrupts.tx_queue_empty: 0 dev.em.0.interrupts.tx_queue_min_thresh: 0 dev.em.0.interrupts.rx_desc_min_thresh: 0 dev.em.0.interrupts.rx_overrun: 0
-
The vmstat output suggests the interrupt storm was short lived. It also shows em1 has three distinct interrupt vectors. Please post the exact text of the interrupt storm message.
Thanks Steve for correcting the grep parameter. It appears that em1 supports interrupt moderation with capability of delaying receive interrupts and transmit interrupts by up to 66 microseconds.
-
edited the previous post, em0 is the culprit
'interrupt storm detected on irq256' is what it says.more often there is no message but cpu gets maxed out handling irqs not long enough for me to even observe this in top but long enough to break both gateways since pf can't ping them when this happens
it looks like "dev.em.0.rx_int_delay: 0" is the settings that would affect the problematic "irq256: em0:rx"
but according to intel messing with "RxIntDelay" has the potential of hanging the adapter
http://www.intel.com/support/network/adapter/pro100/sb/cs-032516.htmanyone have experience with this setting?
-
Perhaps a suitable workaround would be to go to System -> Routing click on Gateways tab and edit your gateways to increase the Frequency Probe (more correctly called the Probe Interval) and the Down time so the gateway monitoring is a bit more robust over busy periods.