High traffic irq problem (no storm)
-
Hello Guys
I have 2 servers with pfSense 2.0.1 in routing mode (MASTER/SLAVE with carp)
for about 50 servers, most of them are DNS serverthe traffic is 'very high' about 50Mbit constant traffic (a lot of udp traffic)
pfsense is 3GHz Xeon , 4 GB of ram (HP DL 360 server) carp (without state sync)
each pfsense have 3 network cards (2 integrate broadcom gigabit (bge) and 1 pci intel (em))
the bge0 is a WAN (public network)
the bge1 have 20 VLANs (servers)
em0 is a mng interfaceYesterday we had a peak of 70Mbit and pfsense hang/freeze!
in particular the percentage of irq went up around to 50%vmstat -i
interrupt total rate
IRQ1: atkbd0 18 0
IRQ14: ATA0 68 0
irq16: uhci0 17 0
irq23: ehci0 2 0
irq24: 58807455 ciss0 4
irq25: 1667747890 bge0 123
irq26: 1694685519 bge1 125
irq48: em0 777403204 57
cpu0: timer 27034549131 2000
Total 31233193304 2310polling/tco and some other advanced tunning options does not provide improvements
How can I do to handling 100Mbit of real internet traffic? :)
Thanks to anyone who can give me some hint
-
I suppose you have already looked at the NIC optimization page about MBUFs and queues and you have changed your bios so that its not doing plug and play aware?
-
em0 is a mng interface
mng = management?
Yesterday we had a peak of 70Mbit and pfsense hang/freeze!
70Mbit (presumably 70Mbps) between WAN and servers? Reported by pfSense RRD graph, Traffic -> WAN? How long did the peak appear to last?
what did you observe that you now describe as hang/freeze? GUI doesn't respond? ssh session stalls? console keyboard doesn't respond to Enter key? Console keyboard Caps Lock indicator doesn't change with presses of Caps Lock key? etc
in particular the percentage of irq went up around to 50%
as reported by top? If so, what was identified as major CPU user? (and if the system was truly "frozen" what would be reporting?)
vmstat -i
interrupt total rate
IRQ1: atkbd0 18 0
IRQ14: ATA0 68 0
irq16: uhci0 17 0
irq23: ehci0 2 0
irq24: 58807455 ciss0 4
irq25: 1667747890 bge0 123
irq26: 1694685519 bge1 125
irq48: em0 777403204 57
cpu0: timer 27034549131 2000
Total 31233193304 2310These interrupt rates are not significant, but because they are averaged since boot time spikes won't show up here.
polling/tco and some other advanced tunning options does not provide improvements
[/quotes]
Improvements as in doesn't hang/freeze?How can I do to handling 100Mbit of real internet traffic? :)
The CPU should be able to forward 100Mbps without much effort at all. I suspect the problem might be a resource exhaustion problem. Perhaps you don't have enough firewall states for the UDP traffic. You can view state use history at Status -> RRD Graphs, System tab, States graphs.
Have you read http://doc.pfsense.org/index.php/Tuning_and_Troubleshooting_Network_Cards
-
mng = management?
Yes
70Mbit (presumably 70Mbps) between WAN and servers? Reported by pfSense RRD graph, Traffic -> WAN? How long did the peak appear to last?
30 mins of 70Mbits (normally I have 50Mbits)
what did you observe that you now describe as hang/freeze? GUI doesn't respond? ssh session stalls? console keyboard doesn't respond to Enter key? Console keyboard Caps Lock indicator doesn't change with presses of Caps Lock key? etc
more than 40% of pkg loss from WAN to VLANS and from VLANS to WAN
console keyboard doesn't respond to Enter key ore respond after some seconds
GUI doesn't respond or respond after some seconds
as reported by top? If so, what was identified as major CPU user? (and if the system was truly "frozen" what would be reporting?)
somthing like:
50.0% system (I use device polling)
50.0% interrupt50.58% idlepoll
20.00% {irq26: bge1}
40.00% {irq25: bge0}The CPU should be able to forward 100Mbps without much effort at all. I suspect the problem might be a resource exhaustion problem. Perhaps you don't have enough firewall states for the UDP traffic. You can view state use history at Status -> RRD Graphs, System tab, States graphs.
I see 100K of peak states, I normal have:
Show states 15123/385000
MBUF Usage 7714/25600Have you read http://doc.pfsense.org/index.php/Tuning_and_Troubleshooting_Network_Cards
yess
Thanks in advance I will try to tune better :/
-
Most obvious awnser would be that the server is to slow to handle that amount of traffic, take a look at CPU and memory usage when the problem is happening.
I assume this is a single core system? It's always recommanded to have a dual core at least.
-
How can it be a HP DL 360 on one core? Unless its running in VM?
In which case I'm going to ask - Why do people keep forgetting to mention virtualization layers in their freaken specs? -
No way that machine should be struggling with <100Mbps.
Do not use device polling. Make sure polling is not active, I've found it can be a bit 'sticky' when I've tried it.Steve
-
2 integrate broadcom gigabit (bge) and 1 pci intel (em)
At the risk of sounding like a simpleton… Can you fit a couple Dual Port Intel PCIe Nics in there?
Go all Intel?
-
Re-reading this if you've tried all the tuning option for your NICs I'd next check the CARP interface.
If you failover to the other box does the situation change.Steve
-
No way that machine should be struggling with <100Mbps.
Do not use device polling. Make sure polling is not active, I've found it can be a bit 'sticky' when I've tried it.Steve
Well, enable some plugins and gone is the 100mbit speed, also single core is not really good for tasks like this since there are always other things to do and putting them in wait is not helping either. Disabling polling won't help much unless the card does not support it.
-
Device polling is not enabled by default and there is very little advantage to enabling it in almost every case. In most cases it makes things worse, sometimes a lot worse!
A bit old now but see: http://blog.pfsense.org/?p=115
And more recently: http://blog.pfsense.org/?p=115#comment-21378I just noticed your traffic is almost all DNS. In that case the total bandwidth is probably less significant than the packets per second. A very high number of small packets will cause a high interrupt load.
Steve
-
Hi & thanks!
unfortunately I can't reproduce the situation high load (high DNS query/sec)
but
Probably I need to upgrade my hardware (I read all document about tunning)
So, instead of my hp DL360 server with embedded 2xBroadcom, what hardware do you recommend?
Integrated Intel or PCI-E addonn card?
What the best Nic? (model/chipset)
AMD 16x core Proc or Intel Quad Core Xeon?
Kind regards !!!
-
Probably I need to upgrade my hardware (I read all document about tunning)
So, instead of my hp DL360 server with embedded 2xBroadcom, what hardware do you recommend?
Integrated Intel or PCI-E addonn card?
What the best Nic? (model/chipset)
AMD 16x core Proc or Intel Quad Core Xeon?
You can throw some more hardware at the problem in the hope it might make a difference but you really need to get more information on what was going on in order to correctly determine the solution. For example, if you have a rogue system (or systems) issuing floods of DNS requests it is unlikely that adding more cores or "server quality" NICs or more RAM will allow you to give "good" DNS response to other systems.