Is managing the state table taking up all of my CPU?
-
One of my firewalls, running 2.1-RC1, is under high CPU load today. I can see that the number of states in the state table is maxed out. I'm seeing the following:
top -SH
last pid: 15560; load averages: 8.02, 7.74, 6.92 up 0+07:59:57 15:34:36
204 processes: 11 running, 129 sleeping, 54 waiting, 10 lock
CPU: 0.0% user, 1.9% nice, 3.3% system, 78.8% interrupt, 16.0% idle
Mem: 718M Active, 52M Inact, 1297M Wired, 68K Cache, 134M Buf, 3807M Free
Swap: 16G Total, 16G FreePID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND
12 root -44 - 0K 1008K *pf ta 1 294:17 80.86% intr{swi1: netisr 1}
12 root -44 - 0K 1008K *pf ta 7 293:34 80.66% intr{swi1: netisr 7}
12 root -44 - 0K 1008K *pf ta 5 293:02 79.79% intr{swi1: netisr 5}
12 root -44 - 0K 1008K *pf ta 2 294:46 79.05% intr{swi1: netisr 2}
12 root -44 - 0K 1008K *pf ta 3 289:17 79.05% intr{swi1: netisr 3}
12 root -44 - 0K 1008K *pf ta 6 294:15 78.76% intr{swi1: netisr 6}
12 root -44 - 0K 1008K *pf ta 0 265:07 78.08% intr{swi1: netisr 0}
12 root -44 - 0K 1008K *pf ta 4 263:36 76.95% intr{swi1: netisr 4}pfctl -si
State Table Total Rate
current entries 1400004
searches 3972651339 137657.3/s
inserts 742211929 25718.6/s
removals 740811925 25670.0/s
Counters
match 773239769 26793.7/s
bad-offset 0 0.0/s
fragment 0 0.0/s
short 0 0.0/s
normalize 0 0.0/s
memory 28656396 993.0/s
bad-timestamp 0 0.0/s
congestion 0 0.0/s
ip-option 0 0.0/s
proto-cksum 1 0.0/s
state-mismatch 5102495 176.8/s
state-insert 0 0.0/s
state-limit 0 0.0/s
src-limit 0 0.0/s
synproxy 0 0.0/s
divert 0 0.0/sI've configured a maximum of 1400000 firewal states. Kind of looks like the system is just busy dealing with the state table, is that right? Any suggestions as to how pfSense can better handle this situation?
Thanks!
-
How much traffic runs through this box?
-
is there a reason you are still running RC1 ? if not start: please upgrade to 2.1 - stable | chances are the problem is gone - if not , then its a thousand times easier to debug
-
In terms of data this firewall peaks at about 425mbps In+Out. In terms of packets, it typically peaks at about 70kpps In+Out, yesterday it was hitting close to 100kpps In+Out.
-
I've got several instances of pfSense ranging from 2.0.1-release up through 2.1-release. My experience has been that I don't see much difference between 2.1-RC1 and 2.1-release. I just need to find a time to upgrade, our business runs high volume 24x7.
I tend to see the problem most when there is high state churn, no matter the version of pfSense.
-
1400000 is a lot of states. ;)
Do you still have the firewall optimisation set to 'normal'? You could try setting it to 'aggressive' so that it times out firewall states quicker. However as the warning says you may end up dropping some legitimate states.
You could try the adaptive timeout settings. Though I have no experience of using those at all they seem relevant here.Steve
-
Thanks, Steve. I've always had to use the 'aggressive' setting. Pre-2.1 versions had the adaptive settings by default, they used to kick in at 60%. In 2.1, the adaptive settings are off by default. I always found that the adaptive settings kicked in too soon, I kind of like not having them - it puts off the pain a little longer in my case.
Yes, 1.4M is a lot of states. Having that many is unusual and undesirable in my case. In addition to the firewall, I use the load balancer in pfSense. A single connection from the WAN, through the load balancer, to a web server sets up a bunch of states.
We have an application with an interface that gets tons of html queries, over and over. We encourage connection reuse but sometimes we get hit with thousands of queries per second, each a new and short-lived connection. Setting up and tearing down thousands of connections a second seems to drive CPU use - I'm guessing related to managing the state table.
I guess I've never tried setting a ridiculous number of maximum states and just let the table grow. Seems like if I set it too high then I get 100% CPU and I've had to power cycle the box in the past (no remote KVM access).
-
So is this actually causing a problem?
You seem to have plenty of RAM in that box so you could increase the state table size. I wouldn't have thought the size of the table would increase CPU usage as much as the rate of states added and removed which would stay the same. There would come a point where the table could not be maxed out due to state decay matching the new state rate. That point might be ridiculously large though! I would have thought you could achieve a balance using the adaptive timeout settings. If you set the initial number quite low (perhaps half the table size, total guess) and the maximum number to some value larger than the table you could get a very gradual roll off as the table filled.This is way outside my experience though. ;)
Steve
-
I've never been clear if I'm dealing with a pure packets-per-second problem (incoming packets driving a lot of interrupts) or a state table problem (too much state churn) or a combination of both. The key part for me in this post is what I see under STATE in the output from top, it shows "*pf ta" - as I understand it this means that the CPU is waiting on the pf process for something. I'm guessing the "ta" part relates to the state table.