Is managing the state table taking up all of my CPU?



  • One of my firewalls, running 2.1-RC1, is under high CPU load today.  I can see that the number of states in the state table is maxed out.  I'm seeing the following:

    top -SH
    last pid: 15560;  load averages:  8.02,  7.74,  6.92                                                      up 0+07:59:57  15:34:36
    204 processes: 11 running, 129 sleeping, 54 waiting, 10 lock
    CPU:  0.0% user,  1.9% nice,  3.3% system, 78.8% interrupt, 16.0% idle
    Mem: 718M Active, 52M Inact, 1297M Wired, 68K Cache, 134M Buf, 3807M Free
    Swap: 16G Total, 16G Free

    PID USERNAME PRI NICE  SIZE    RES STATE  C  TIME  WCPU COMMAND
      12 root    -44    -    0K  1008K *pf ta  1 294:17 80.86% intr{swi1: netisr 1}
      12 root    -44    -    0K  1008K *pf ta  7 293:34 80.66% intr{swi1: netisr 7}
      12 root    -44    -    0K  1008K *pf ta  5 293:02 79.79% intr{swi1: netisr 5}
      12 root    -44    -    0K  1008K *pf ta  2 294:46 79.05% intr{swi1: netisr 2}
      12 root    -44    -    0K  1008K *pf ta  3 289:17 79.05% intr{swi1: netisr 3}
      12 root    -44    -    0K  1008K *pf ta  6 294:15 78.76% intr{swi1: netisr 6}
      12 root    -44    -    0K  1008K *pf ta  0 265:07 78.08% intr{swi1: netisr 0}
      12 root    -44    -    0K  1008K *pf ta  4 263:36 76.95% intr{swi1: netisr 4}

    pfctl -si
    State Table                          Total            Rate
      current entries                  1400004
      searches                      3972651339      137657.3/s
      inserts                        742211929        25718.6/s
      removals                      740811925        25670.0/s
    Counters
      match                          773239769        26793.7/s
      bad-offset                            0            0.0/s
      fragment                              0            0.0/s
      short                                  0            0.0/s
      normalize                              0            0.0/s
      memory                          28656396          993.0/s
      bad-timestamp                          0            0.0/s
      congestion                            0            0.0/s
      ip-option                              0            0.0/s
      proto-cksum                            1            0.0/s
      state-mismatch                  5102495          176.8/s
      state-insert                          0            0.0/s
      state-limit                            0            0.0/s
      src-limit                              0            0.0/s
      synproxy                              0            0.0/s
      divert                                0            0.0/s

    I've configured a maximum of 1400000 firewal states.  Kind of looks like the system is just busy dealing with the state table, is that right?  Any suggestions as to how pfSense can better handle this situation?

    Thanks!



  • How much traffic runs through this box?



  • is there a reason you are still running RC1 ? if not start: please upgrade to 2.1 - stable | chances are the problem is gone - if not , then its a thousand times easier to debug



  • @Jason

    In terms of data this firewall peaks at about 425mbps In+Out.  In terms of packets, it typically peaks at about 70kpps In+Out, yesterday it was hitting close to 100kpps In+Out.



  • @heper

    I've got several instances of pfSense ranging from 2.0.1-release up through 2.1-release.  My experience has been that I don't see much difference between 2.1-RC1 and 2.1-release.  I just need to find a time to upgrade, our business runs high volume 24x7.

    I tend to see the problem most when there is high state churn, no matter the version of pfSense.


  • Netgate Administrator

    1400000 is a lot of states.  ;)
    Do you still have the firewall optimisation set to 'normal'? You could try setting it to 'aggressive' so that it times out firewall states quicker. However as the warning says you may end up dropping some legitimate states.
    You could try the adaptive timeout settings. Though I have no experience of using those at all they seem relevant here.

    Steve



  • Thanks, Steve.  I've always had to use the 'aggressive' setting.  Pre-2.1 versions had the adaptive settings by default, they used to kick in at 60%.  In 2.1, the adaptive settings are off by default.  I always found that the adaptive settings kicked in too soon, I kind of like not having them - it puts off the pain a little longer in my case.

    Yes, 1.4M is a lot of states.  Having that many is unusual and undesirable in my case.  In addition to the firewall, I use the load balancer in pfSense.  A single connection from the WAN, through the load balancer, to a web server sets up a bunch of states.

    We have an application with an interface that gets tons of html queries, over and over.  We encourage connection reuse but sometimes we get hit with thousands of queries per second, each a new and short-lived connection.  Setting up and tearing down thousands of connections a second seems to drive CPU use - I'm guessing related to managing the state table.

    I guess I've never tried setting a ridiculous number of maximum states and just let the table grow.  Seems like if I set it too high then I get 100% CPU and I've had to power cycle the box in the past (no remote KVM access).


  • Netgate Administrator

    So is this actually causing a problem?
    You seem to have plenty of RAM in that box so you could increase the state table size. I wouldn't have thought the size of the table would increase CPU usage as much as the rate of states added and removed which would stay the same. There would come a point where the table could not be maxed out due to state decay matching the new state rate. That point might be ridiculously large though! I would have thought you could achieve a balance using the adaptive timeout settings. If you set the initial number quite low (perhaps half the table size, total guess) and the maximum number to some value larger than the table you could get a very gradual roll off as the table filled.

    This is way outside my experience though.  ;)

    Steve



  • I've never been clear if I'm dealing with a pure packets-per-second problem (incoming packets driving a lot of interrupts) or a state table problem (too much state churn) or a combination of both.  The key part for me in this post is what I see under STATE in the output from top, it shows "*pf ta" - as I understand it this means that the CPU is waiting on the pf process for something.  I'm guessing the "ta" part relates to the state table.