High traffic irq problem (no storm)
-
mng = management?
Yes
70Mbit (presumably 70Mbps) between WAN and servers? Reported by pfSense RRD graph, Traffic -> WAN? How long did the peak appear to last?
30 mins of 70Mbits (normally I have 50Mbits)
what did you observe that you now describe as hang/freeze? GUI doesn't respond? ssh session stalls? console keyboard doesn't respond to Enter key? Console keyboard Caps Lock indicator doesn't change with presses of Caps Lock key? etc
more than 40% of pkg loss from WAN to VLANS and from VLANS to WAN
console keyboard doesn't respond to Enter key ore respond after some seconds
GUI doesn't respond or respond after some seconds
as reported by top? If so, what was identified as major CPU user? (and if the system was truly "frozen" what would be reporting?)
somthing like:
50.0% system (I use device polling)
50.0% interrupt50.58% idlepoll
20.00% {irq26: bge1}
40.00% {irq25: bge0}The CPU should be able to forward 100Mbps without much effort at all. I suspect the problem might be a resource exhaustion problem. Perhaps you don't have enough firewall states for the UDP traffic. You can view state use history at Status -> RRD Graphs, System tab, States graphs.
I see 100K of peak states, I normal have:
Show states 15123/385000
MBUF Usage 7714/25600Have you read http://doc.pfsense.org/index.php/Tuning_and_Troubleshooting_Network_Cards
yess
Thanks in advance I will try to tune better :/
-
Most obvious awnser would be that the server is to slow to handle that amount of traffic, take a look at CPU and memory usage when the problem is happening.
I assume this is a single core system? It's always recommanded to have a dual core at least.
-
How can it be a HP DL 360 on one core? Unless its running in VM?
In which case I'm going to ask - Why do people keep forgetting to mention virtualization layers in their freaken specs? -
No way that machine should be struggling with <100Mbps.
Do not use device polling. Make sure polling is not active, I've found it can be a bit 'sticky' when I've tried it.Steve
-
2 integrate broadcom gigabit (bge) and 1 pci intel (em)
At the risk of sounding like a simpleton… Can you fit a couple Dual Port Intel PCIe Nics in there?
Go all Intel?
-
Re-reading this if you've tried all the tuning option for your NICs I'd next check the CARP interface.
If you failover to the other box does the situation change.Steve
-
No way that machine should be struggling with <100Mbps.
Do not use device polling. Make sure polling is not active, I've found it can be a bit 'sticky' when I've tried it.Steve
Well, enable some plugins and gone is the 100mbit speed, also single core is not really good for tasks like this since there are always other things to do and putting them in wait is not helping either. Disabling polling won't help much unless the card does not support it.
-
Device polling is not enabled by default and there is very little advantage to enabling it in almost every case. In most cases it makes things worse, sometimes a lot worse!
A bit old now but see: http://blog.pfsense.org/?p=115
And more recently: http://blog.pfsense.org/?p=115#comment-21378I just noticed your traffic is almost all DNS. In that case the total bandwidth is probably less significant than the packets per second. A very high number of small packets will cause a high interrupt load.
Steve
-
Hi & thanks!
unfortunately I can't reproduce the situation high load (high DNS query/sec)
but
Probably I need to upgrade my hardware (I read all document about tunning)
So, instead of my hp DL360 server with embedded 2xBroadcom, what hardware do you recommend?
Integrated Intel or PCI-E addonn card?
What the best Nic? (model/chipset)
AMD 16x core Proc or Intel Quad Core Xeon?
Kind regards !!!
-
Probably I need to upgrade my hardware (I read all document about tunning)
So, instead of my hp DL360 server with embedded 2xBroadcom, what hardware do you recommend?
Integrated Intel or PCI-E addonn card?
What the best Nic? (model/chipset)
AMD 16x core Proc or Intel Quad Core Xeon?
You can throw some more hardware at the problem in the hope it might make a difference but you really need to get more information on what was going on in order to correctly determine the solution. For example, if you have a rogue system (or systems) issuing floods of DNS requests it is unlikely that adding more cores or "server quality" NICs or more RAM will allow you to give "good" DNS response to other systems.