High traffic irq problem (no storm)



  • Hello Guys

    I have 2 servers  with pfSense 2.0.1 in routing mode (MASTER/SLAVE with carp)
    for about 50 servers, most of them are DNS server

    the traffic is 'very high'  about 50Mbit constant traffic (a lot of udp traffic)

    pfsense is  3GHz Xeon , 4 GB of ram (HP DL 360 server) carp (without state sync)

    each pfsense have 3 network cards (2 integrate broadcom gigabit (bge) and 1 pci intel (em))

    the bge0 is a WAN (public network)
    the bge1 have 20 VLANs (servers)
    em0 is a mng interface

    Yesterday  we had a peak of 70Mbit and pfsense  hang/freeze!
    in particular the percentage of irq went up around  to  50%

    vmstat -i
    interrupt total rate
    IRQ1: atkbd0 18 0
    IRQ14: ATA0 68 0
    irq16: uhci0 17 0
    irq23: ehci0 2 0
    irq24: 58807455 ciss0 4
    irq25: 1667747890 bge0 123
    irq26: 1694685519 bge1 125
    irq48: em0 777403204 57
    cpu0: timer 27034549131 2000
    Total 31233193304 2310

    polling/tco and some other advanced tunning options does not provide improvements

    How can I do to handling 100Mbit of real internet traffic? :)

    Thanks to anyone who can give me some hint



  • I suppose you have already looked at the NIC optimization page about MBUFs and queues and you have changed your bios so that its not doing plug and play aware?



  • @bsd3000:

    em0 is a mng interface

    mng = management?

    @bsd3000:

    Yesterday  we had a peak of 70Mbit and pfsense  hang/freeze!

    70Mbit (presumably 70Mbps) between WAN and servers? Reported by pfSense RRD graph, Traffic -> WAN? How long did the peak appear to last?

    what did you observe that you now describe as hang/freeze? GUI doesn't respond? ssh session stalls? console keyboard doesn't respond to Enter key? Console keyboard Caps Lock indicator doesn't change with presses of Caps Lock key? etc

    @bsd3000:

    in particular the percentage of irq went up around  to  50%

    as reported by top? If so, what was identified as major CPU user? (and if the system was truly "frozen" what would be reporting?)

    @bsd3000:

    vmstat -i
    interrupt total rate
    IRQ1: atkbd0 18 0
    IRQ14: ATA0 68 0
    irq16: uhci0 17 0
    irq23: ehci0 2 0
    irq24: 58807455 ciss0 4
    irq25: 1667747890 bge0 123
    irq26: 1694685519 bge1 125
    irq48: em0 777403204 57
    cpu0: timer 27034549131 2000
    Total 31233193304 2310

    These interrupt rates are not significant, but because they are averaged since boot time spikes won't show up here.

    @bsd3000:

    polling/tco and some other advanced tunning options does not provide improvements
    [/quotes]
    Improvements as in doesn't hang/freeze?

    @bsd3000:

    How can I do to handling 100Mbit of real internet traffic? :)

    The CPU should be able to forward 100Mbps without much effort at all. I suspect the problem might be a resource exhaustion problem. Perhaps you don't have enough firewall states for the UDP traffic. You can view state use history at Status -> RRD Graphs, System tab, States graphs.

    Have you read http://doc.pfsense.org/index.php/Tuning_and_Troubleshooting_Network_Cards



  • @wallabybob:

    mng = management?

    Yes

    @wallabybob:

    70Mbit (presumably 70Mbps) between WAN and servers? Reported by pfSense RRD graph, Traffic -> WAN? How long did the peak appear to last?

    30 mins of 70Mbits (normally I have 50Mbits)

    @wallabybob:

    what did you observe that you now describe as hang/freeze? GUI doesn't respond? ssh session stalls? console keyboard doesn't respond to Enter key? Console keyboard Caps Lock indicator doesn't change with presses of Caps Lock key? etc

    more than 40% of pkg loss from WAN to VLANS and from VLANS to WAN

    console keyboard doesn't respond to Enter key ore respond after some seconds

    GUI doesn't respond or respond after some seconds

    @wallabybob:

    as reported by top? If so, what was identified as major CPU user? (and if the system was truly "frozen" what would be reporting?)

    somthing like:
    50.0% system (I use device polling)
    50.0% interrupt

    50.58% idlepoll
      20.00% {irq26: bge1}
      40.00% {irq25: bge0}

    @wallabybob:

    The CPU should be able to forward 100Mbps without much effort at all. I suspect the problem might be a resource exhaustion problem. Perhaps you don't have enough firewall states for the UDP traffic. You can view state use history at Status -> RRD Graphs, System tab, States graphs.

    I see 100K of peak states, I normal have:
    Show states  15123/385000
    MBUF Usage 7714/25600

    @wallabybob:

    Have you read http://doc.pfsense.org/index.php/Tuning_and_Troubleshooting_Network_Cards

    yess

    Thanks in advance I will try to tune better :/



  • Most obvious awnser would be that the server is to slow to handle that amount of traffic, take a look at CPU and memory usage when the problem is happening.

    I assume this is a single core system? It's always recommanded to have a dual core at least.



  • How can it be a HP DL 360 on one core? Unless its running in VM?
    In which case I'm going to ask - Why do people keep forgetting to mention virtualization layers in their freaken specs?


  • Netgate Administrator

    No way that machine should be struggling with <100Mbps.
    Do not use device polling. Make sure polling is not active, I've found it can be a bit 'sticky' when I've tried it.

    Steve



  • 2 integrate broadcom gigabit (bge) and 1 pci intel (em)

    At the risk of sounding like a simpleton…  Can you fit a couple Dual Port Intel PCIe Nics in there?

    Go all Intel?


  • Netgate Administrator

    Re-reading this if you've tried all the tuning option for your NICs I'd next check the CARP interface.
    If you failover to the other box does the situation change.

    Steve



  • @stephenw10:

    No way that machine should be struggling with <100Mbps.
    Do not use device polling. Make sure polling is not active, I've found it can be a bit 'sticky' when I've tried it.

    Steve

    Well, enable some plugins and gone is the 100mbit speed, also single core is not really good for tasks like this since there are always other things to do and putting them in wait is not helping either. Disabling polling won't help much unless the card does not support it.


  • Netgate Administrator

    Device polling is not enabled by default and there is very little advantage to enabling it in almost every case. In most cases it makes things worse, sometimes a lot worse!
    A bit old now but see: http://blog.pfsense.org/?p=115
    And more recently: http://blog.pfsense.org/?p=115#comment-21378

    I just noticed your traffic is almost all DNS. In that case the total bandwidth is probably less significant than the packets per second. A very high number of small packets will cause a high interrupt load.

    Steve



  • Hi & thanks!

    unfortunately I can't reproduce the situation high load (high DNS query/sec)

    but

    Probably I need to upgrade my hardware (I read all document about tunning)

    So, instead of my hp DL360 server with embedded 2xBroadcom, what hardware do you recommend?

    Integrated Intel or PCI-E addonn card?

    What the best Nic? (model/chipset)

    AMD 16x core Proc or Intel Quad Core Xeon?

    Kind regards !!!



  • @bsd3000:

    Probably I need to upgrade my hardware (I read all document about tunning)

    So, instead of my hp DL360 server with embedded 2xBroadcom, what hardware do you recommend?

    Integrated Intel or PCI-E addonn card?

    What the best Nic? (model/chipset)

    AMD 16x core Proc or Intel Quad Core Xeon?

    You can throw some more hardware at the problem in the hope it might make a difference but you really need to get more information on what was going on in order to correctly determine the solution. For example, if you have a rogue system (or systems) issuing floods of DNS requests it is unlikely that adding more cores or "server quality" NICs or more RAM will allow you to give "good" DNS response to other systems.


Log in to reply