VLAN Interrupt storm solutions? pf 2.03 / X7SPA-HF / Intel 82574L



  • I have 2x Intel 82574L NICs on a Supermicro D510.

    em0 WAN1 (2 VLANS)
    em1 WAN2 + LAN (3 VLANS)

    em1 gets interrupt storms and freezes the entire box.
    This happens with at low throughput but high number of connections.

    No IRQs are shared, BIOS manages IRQs and unused devices are off.

    Checking or unchecking the following makes no difference
    Device polling
    Hardware Checksum Offloading
    Hardware TCP Segmentation Offloading
    Hardware Large Receive Offloading

    Is there a software change that would fix this?



  • @themisa:

    em1 gets interrupt storms and freezes the entire box.
    . . .
    No IRQs are shared, BIOS manages IRQs and unused devices are off.

    If em1 truly is the sole user of its irq then the interrupt storm report suggests the em driver isn't correctly clearing an interrupt condition on em1. PERHAPS the newer device driver in pfSense 2.1 snapshot builds will work better.



  • i use a big chunk of pf features, chances are 2.1 will just introduce new issues
    this is an older server grade motherboard with an intel driver, if it's not compatible i don't know what is

    are there any useful commands i can run to narrow down the issue to something more specific?



  • @themisa:

    this is an older server grade motherboard with an intel driver, if it's not compatible i don't know what is

    New devices are not always "compatible" with old device drivers. pfSense 2.0.x is based on FreeBSD 8.1 which is about two years older than FreeBSD 8.3 used in the pfSense 2.1 snapshot builds.

    @themisa:

    are there any useful commands i can run to narrow down the issue to something more specific?

    Some of the em devices have a feature called "interrupt moderation" - interrupts are delayed a programmable period to reduce the overhead of the interrupt by giving a single interrupt more work to do. For example, on a busy interface by delaying an interrupt request by (say) 100 microseconds there might be 10 receive packets available for processing rather than one. Please post the output of pfSense shell command```
    sysctl -a | grep em1

    PERHAPS tweaking the interrupt moderation will reduce the interrupt storm reports.
    
    It could also be useful to post the output of pfSense shell commands:```
    ifconfig
    vmstat -i
    


  • interrupt                          total            rate
    irq18: ehci0 uhci5            2                  0
    irq19: uhci2 uhci4+        19                0
    irq23: uhci3 ehci1            97485          2
    cpu0: timer                    17324399    399
    irq256: em0:rx 0            5371292    124
    irq257: em0:tx 0            5694835    131
    irq258: em0:link              15281          0
    irq259: em1:rx 0            2296709      53
    irq260: em1:tx 0            2130004      49
    irq261: em1:link              19916          0
    cpu1: timer                    17320395    399
    cpu2: timer                    17320395    399
    cpu3: timer                    17320395    399
    Total                              84911127    1960

    "sysctl -a | grep em1" doesn't find anything
    thanks for trying wallabybob
    time to start thinking new hardware


  • Netgate Administrator

    Try em.1:

    [2.0.3-RELEASE][root@pfsense.fire.box]/root(1): sysctl -a|grep em1
    [2.0.3-RELEASE][root@pfsense.fire.box]/root(2): sysctl -a | grep em.1
    dev.em.1.%desc: Intel(R) PRO/1000 Legacy Network Connection 1.0.4
    dev.em.1.%driver: em
    dev.em.1.%location: slot=14 function=0
    dev.em.1.%pnpinfo: vendor=0x8086 device=0x1079 subvendor=0x8086 subdevice=0x1179 class=0x020000
    dev.em.1.%parent: pci3
    dev.em.1.nvm: -1
    dev.em.1.rx_int_delay: 0
    dev.em.1.tx_int_delay: 66
    dev.em.1.rx_abs_int_delay: 66
    dev.em.1.tx_abs_int_delay: 66
    dev.em.1.rx_processing_limit: 100
    dev.em.1.flow_control: 3
    dev.em.1.mbuf_alloc_fail: 0
    dev.em.1.cluster_alloc_fail: 0
    dev.em.1.dropped: 0
    dev.em.1.tx_dma_fail: 0
    dev.em.1.tx_desc_fail1: 0
    dev.em.1.tx_desc_fail2: 0
    dev.em.1.rx_overruns: 0
    dev.em.1.watchdog_timeouts: 0
    dev.em.1.device_control: 1492124233
    dev.em.1.rx_control: 32770
    dev.em.1.fc_high_water: 47104
    dev.em.1.fc_low_water: 45604
    dev.em.1.fifo_workaround: 0
    dev.em.1.fifo_reset: 0
    dev.em.1.txd_head: 83
    dev.em.1.txd_tail: 84
    dev.em.1.rxd_head: 191
    dev.em.1.rxd_tail: 190
    dev.em.1.mac_stats.excess_coll: 0
    dev.em.1.mac_stats.single_coll: 0
    dev.em.1.mac_stats.multiple_coll: 0
    dev.em.1.mac_stats.late_coll: 0
    dev.em.1.mac_stats.collision_count: 0
    dev.em.1.mac_stats.symbol_errors: 0
    dev.em.1.mac_stats.sequence_errors: 0
    dev.em.1.mac_stats.defer_count: 0
    dev.em.1.mac_stats.missed_packets: 0
    dev.em.1.mac_stats.recv_no_buff: 0
    dev.em.1.mac_stats.recv_undersize: 0
    dev.em.1.mac_stats.recv_fragmented: 0
    dev.em.1.mac_stats.recv_oversize: 0
    dev.em.1.mac_stats.recv_jabber: 0
    dev.em.1.mac_stats.recv_errs: 0
    dev.em.1.mac_stats.crc_errs: 0
    dev.em.1.mac_stats.alignment_errs: 0
    dev.em.1.mac_stats.coll_ext_errs: 0
    dev.em.1.mac_stats.xon_recvd: 0
    dev.em.1.mac_stats.xon_txd: 0
    dev.em.1.mac_stats.xoff_recvd: 0
    dev.em.1.mac_stats.xoff_txd: 0
    dev.em.1.mac_stats.total_pkts_recvd: 34413999
    dev.em.1.mac_stats.good_pkts_recvd: 34413999
    dev.em.1.mac_stats.bcast_pkts_recvd: 32180
    dev.em.1.mac_stats.mcast_pkts_recvd: 0
    dev.em.1.mac_stats.rx_frames_64: 6363096
    dev.em.1.mac_stats.rx_frames_65_127: 17326141
    dev.em.1.mac_stats.rx_frames_128_255: 6554914
    dev.em.1.mac_stats.rx_frames_256_511: 1041309
    dev.em.1.mac_stats.rx_frames_512_1023: 1386613
    dev.em.1.mac_stats.rx_frames_1024_1522: 1741926
    dev.em.1.mac_stats.good_octets_recvd: 6621995344
    dev.em.1.mac_stats.good_octets_txd: 28106065051
    dev.em.1.mac_stats.total_pkts_txd: 40379356
    dev.em.1.mac_stats.good_pkts_txd: 40379356
    dev.em.1.mac_stats.bcast_pkts_txd: 26373
    dev.em.1.mac_stats.mcast_pkts_txd: 5
    dev.em.1.mac_stats.tx_frames_64: 1438817
    dev.em.1.mac_stats.tx_frames_65_127: 12731412
    dev.em.1.mac_stats.tx_frames_128_255: 6641964
    dev.em.1.mac_stats.tx_frames_256_511: 1181055
    dev.em.1.mac_stats.tx_frames_512_1023: 1178482
    dev.em.1.mac_stats.tx_frames_1024_1522: 17207626
    dev.em.1.mac_stats.tso_txd: 0
    dev.em.1.mac_stats.tso_ctx_fail: 0
    
    

    Steve



  • good catch

    em0 gets interrupt storms not em1, sorry, i had them reversed, it's like this

    em0 WAN2 + LAN (3 VLANS)
    em1 WAN1 (2 VLANS) 
    so em0 has more traffic due to lan.

    
    dev.em.0.%desc: Intel(R) PRO/1000 Network Connection 7.3.2
    dev.em.0.%driver: em
    dev.em.0.%location: slot=0 function=0
    dev.em.0.%pnpinfo: vendor=0x8086 device=0x10d3 subvendor=0x15d9 subdevice=0x060a class=0x020000
    dev.em.0.%parent: pci2
    dev.em.0.nvm: -1
    dev.em.0.debug: -1
    dev.em.0.fc: 3
    dev.em.0.rx_int_delay: 0
    dev.em.0.tx_int_delay: 66
    dev.em.0.rx_abs_int_delay: 66
    dev.em.0.tx_abs_int_delay: 66
    dev.em.0.rx_processing_limit: 100
    dev.em.0.eee_control: 0
    dev.em.0.link_irq: 76
    dev.em.0.mbuf_alloc_fail: 0
    dev.em.0.cluster_alloc_fail: 0
    dev.em.0.dropped: 0
    dev.em.0.tx_dma_fail: 0
    dev.em.0.rx_overruns: 0
    dev.em.0.watchdog_timeouts: 0
    dev.em.0.device_control: 1074790984
    dev.em.0.rx_control: 67403778
    dev.em.0.fc_high_water: 18432
    dev.em.0.fc_low_water: 16932
    dev.em.0.queue0.txd_head: 14
    dev.em.0.queue0.txd_tail: 14
    dev.em.0.queue0.tx_irq: 106328859
    dev.em.0.queue0.no_desc_avail: 0
    dev.em.0.queue0.rxd_head: 541
    dev.em.0.queue0.rxd_tail: 540
    dev.em.0.queue0.rx_irq: 120560838
    dev.em.0.mac_stats.excess_coll: 0
    dev.em.0.mac_stats.single_coll: 0
    dev.em.0.mac_stats.multiple_coll: 0
    dev.em.0.mac_stats.late_coll: 0
    dev.em.0.mac_stats.collision_count: 0
    dev.em.0.mac_stats.symbol_errors: 0
    dev.em.0.mac_stats.sequence_errors: 0
    dev.em.0.mac_stats.defer_count: 0
    dev.em.0.mac_stats.missed_packets: 309717
    dev.em.0.mac_stats.recv_no_buff: 119595
    dev.em.0.mac_stats.recv_undersize: 0
    dev.em.0.mac_stats.recv_fragmented: 0
    dev.em.0.mac_stats.recv_oversize: 0
    dev.em.0.mac_stats.recv_jabber: 0
    dev.em.0.mac_stats.recv_errs: 0
    dev.em.0.mac_stats.crc_errs: 0
    dev.em.0.mac_stats.alignment_errs: 0
    dev.em.0.mac_stats.coll_ext_errs: 0
    dev.em.0.mac_stats.xon_recvd: 0
    dev.em.0.mac_stats.xon_txd: 0
    dev.em.0.mac_stats.xoff_recvd: 0
    dev.em.0.mac_stats.xoff_txd: 0
    dev.em.0.mac_stats.total_pkts_recvd: 193440710
    dev.em.0.mac_stats.good_pkts_recvd: 193130993
    dev.em.0.mac_stats.bcast_pkts_recvd: 32757
    dev.em.0.mac_stats.mcast_pkts_recvd: 467191
    dev.em.0.mac_stats.rx_frames_64: 16348866
    dev.em.0.mac_stats.rx_frames_65_127: 96904487
    dev.em.0.mac_stats.rx_frames_128_255: 47080072
    dev.em.0.mac_stats.rx_frames_256_511: 12219416
    dev.em.0.mac_stats.rx_frames_512_1023: 3441920
    dev.em.0.mac_stats.rx_frames_1024_1522: 17136232
    dev.em.0.mac_stats.good_octets_recvd: 48932482221
    dev.em.0.mac_stats.good_octets_txd: 86853271375
    dev.em.0.mac_stats.total_pkts_txd: 159394546
    dev.em.0.mac_stats.good_pkts_txd: 159394546
    dev.em.0.mac_stats.bcast_pkts_txd: 11344
    dev.em.0.mac_stats.mcast_pkts_txd: 114483
    dev.em.0.mac_stats.tx_frames_64: 2459808
    dev.em.0.mac_stats.tx_frames_65_127: 59291145
    dev.em.0.mac_stats.tx_frames_128_255: 25944268
    dev.em.0.mac_stats.tx_frames_256_511: 22065346
    dev.em.0.mac_stats.tx_frames_512_1023: 5965944
    dev.em.0.mac_stats.tx_frames_1024_1522: 43668035
    dev.em.0.mac_stats.tso_txd: 0
    dev.em.0.mac_stats.tso_ctx_fail: 0
    dev.em.0.interrupts.asserts: 77
    dev.em.0.interrupts.rx_pkt_timer: 0
    dev.em.0.interrupts.rx_abs_timer: 0
    dev.em.0.interrupts.tx_pkt_timer: 0
    dev.em.0.interrupts.tx_abs_timer: 0
    dev.em.0.interrupts.tx_queue_empty: 0
    dev.em.0.interrupts.tx_queue_min_thresh: 0
    dev.em.0.interrupts.rx_desc_min_thresh: 0
    dev.em.0.interrupts.rx_overrun: 0
    
    


  • The vmstat output suggests the interrupt storm was short lived. It also shows em1 has three distinct interrupt vectors. Please post the exact text of the interrupt storm message.

    Thanks Steve for correcting the grep parameter. It appears that em1 supports interrupt moderation with capability of delaying receive interrupts and transmit interrupts by up to 66 microseconds.



  • edited the previous post, em0 is the culprit
    'interrupt storm detected on irq256' is what it says.

    more often there is no message but cpu gets maxed out handling irqs not long enough for me to even observe this in top but long enough to break both gateways since pf can't ping them when this happens

    it looks like "dev.em.0.rx_int_delay: 0" is the settings that would affect the problematic "irq256: em0:rx"
    but according to intel messing with "RxIntDelay" has the potential of hanging the adapter
    http://www.intel.com/support/network/adapter/pro100/sb/cs-032516.htm

    anyone have experience with this setting?



  • Perhaps a suitable workaround would be to go to System -> Routing click on Gateways tab and edit your gateways to increase the Frequency Probe (more correctly called the Probe Interval) and the Down time so the gateway monitoring is a bit more robust over busy periods.


Log in to reply