PF States limit reached.

xciter327

I am trying to troubleshoot the following issues. I have a firewall that occasionally stops working wit the above message. When that happens the gateway goes down and all traffic via the firewall stops. Gateway monitoring is disabled and state killing on gateway down is disabled.

It was my understanding that "Firewall adaptive timeouts" are supposed to deal with this, however that does not seem to be the case. I have them configured at 5 million sates(system has 8GB of RAM), with "lower adaptive timeout value" of 3 million and "higher adaptive timeout value" of 4 million. If I am reading the documentation correctly, it should start adapting the timeouts at 3 million and set the timeouts to "0" at 4 million. This does not seem to happen. When I check the number of active states it shows a value of 5 million+ even tough the gateway is down. The only thing that works on the WAN is ARP.

Besides ICMP(echo reply, echo request, info reply, unreachable, parameter problem, packet too big) the firewall does not reply to other traffic externally. Access to management ports is blocked via a floating rule and an alias.

Normally the firewall only goes to 20k states.

0_1537261345199_a3e15d35-e451-4316-a7bf-361af23578ad-image.png

Now I guess I have couple of options:

Set "Max. src. states" or "Max. src. conn. Rate".
Enable state killing on gateway down. not very keen on this as, we've had issues before with gateway monitoring, so it's currently disabled).
Other?

Any suggestions would be welcome

heper

this is a huge network that needs 5million states ? or is there a malfunction ?

xciter327

Not huge. Coupl if hundred clients NAT-ed on a /29 + IPv6 deployment. I think they call it dual stack lite.

beatvjiking

I set max src states to 8192 on my networks. With a few hundred devices, even with dual-stack, you're seeing way too many. I've NATed for thousands of devices and not seen that many states.

beatvjiking

I've seen state tables that size only in instances when malware is in play, or in one case, when an intern for a well-known antimalware company wrote a naive script querying their entire list of malicious domains with no limits on queries per second.

You may also want to try setting firewall optimization to "aggressive" but the preferable option is to limit max src states.

xciter327

@beatvjiking said in PF States limit reached.:

I've seen state tables that size only in instances when malware is in play, or in one case, when an intern for a well-known antimalware company wrote a naive script querying their entire list of malicious domains with no limits on queries per second.

You may also want to try setting firewall optimization to "aggressive" but the preferable option is to limit max src states.

My thinking exactly. This is a student network, so god knows what are they trying to do.

Interestingly enough some time ago I was doing tests with hping and packet generator(pktgen I think) and I have managed to fully load up the device (full CPU, full state table etc, interfaces at capacity), however normally after I stop the test the device always recovered. It crashed only once, from many test, but I could not reproduce it. This full lockup I've never managed to reproduce.

This is exactly why the states numbers have been raised. I've tested it up to 5M states with no issues.

A Former User

I read the doco the same way, why doesn't the firewall start just nuking sessions straight away? It shouldn't be possible to hit 5M with your config.

I agree that's not the right solution for your problem, but regardless, shouldn't this be working for you?

xciter327

That was my thinking exactly. I've just added the "max src states" to all the firewall rules(which are pass).

xciter327

Also the firewall had a kernel panic on reboot.(decide to reboot it because the graphs were not working).

0_1537343874714_df984a18-4f5d-424f-a287-2d0fdd66e793-image.png

I checked in /var/crashes and there was no dump.

heper

So this problem happens every 200 days or so? Uptime in screenshot.....

xciter327

No it happens once a day for the last 3 days. I don't reboot the firewall, I just flush the states. "pfctl -F states all"

xciter327

Just wanted to report it has not happened since I put the limits on.

beatvjiking

You can probably find in your logs what device(s) are attempting to open so many sessions and address whatever is happening - i.e. malware or what have you.

xciter327

Just to report it happened again. In my eyes, there are two options: Option 1 is adaptive timeouts are not working. Option 2 is the device somehow running out of memory. I can see in the monitoring graph that when the states reach roughly 900k the device becomes un-resposive. I've set much lower adaptive timeouts now and put the max states to 5mil(8G RAM). max src states is at 8096 on each firewall rule.

If anybody has a suggestion on how to simulate a lot of connection states from multiple IP, I would love to hear it.

SteveITS

You're not alone: https://www.google.com/search?client=firefox-b-1-d&channel=cus&q=pfsense+pf+states+limit+reached

this one mentions a Spiceworks scan of a large subnet:
https://forum.netgate.com/topic/81059/zone-pf-states-pf-states-limit-reached-how-to-find-the-offender/10
Given that post (simultaneous scan), how often are the adaptive timeouts processed/changed/updated by pfSense? (instantly, every 5 minutes, etc.)

beatvjiking

When I recommended 8k states, that was a very high ceiling. It works well in my environment but in most environments that can be far far lower with no negative impacts on user experience. 512 is a reasonable limit to impose on your allow rules. You may want to try that as an alternative to more RAM :)

Derelict

Is this running on Hyper-V?

xciter327

@Derelict said in PF States limit reached.:

Is this running on Hyper-V?

Appreciate your reply. It is on a physical box. Supemicro Atom C2750.