UDP States Killed Before Expiration

giankso

Hi,
we are having troubles with UDP connections getting closed prematurely.

pfSense is already operating with conservative settings but this is not the problem which still remains: states are really dropped before their expiration and applications have to reconnect all the time with SIP services .
Upon checks with pftop we think there is a possible serious bug with the timeouts, at least with UDP: connections with more than 13 minutes of expiry left are suddenly closed.

These connections are idle for approximately 6 minutes before each exchange, no drop occurs if there is continuous traffic and this is why i am pointing to the expiration timer.
What is causing a UDP state to be dropped before it's expiration given an almost empty states table (less than 2% full)?

OpnSense does not provide a per rule state expiration, but such is provided in pfSense and after testing it yesterday i can tell the issue occurs with both. In both products you can see sometimes the timer being far from expiry and still the state gets dropped for no apparent reason.

Shall i file a bug? Could a bug within pf itself the cause of this?
Thanks a lot

jimp

Do you have the option set to kill states on gateway failure?

The main way states get killed in the base system before their timers run out are:

States cleared manually
States killed during gateway events if configured to do so (System > Advanced, Misc tab, under Gateway Monitoring)
States killed early due to adaptive state timeouts when the table is close to full
States killed on a pfsync peer which get deleted via pfsync
States killed when a host triggers some kind of protection mechanism (e.g. IDS/IPS block, repeated auth failures, etc)

giankso

@jimp
Hi, first of all thanks a lot for the clear kill mechanisms list, it helps a lot in ruling out the root cause in this case and might help others in the future.

Now, given I am not killing the states manually...
I can confirm the following:

State Killing on Gateway Failure: NOT ENABLED
Reset All States: NOT ENABLED
Firewall Optimization Options: CONSERVATIVE
MBUF Usage: 1%
Yes we use CARP and multiple gateways but the VMs are on the same cluster and they run all the time together in a virtual networking env. I could not find evidence of pfsync deleting/adding nodes constantly, according to logs.
Specific Rule Timeout (as an attempt): 60:00
I would exclude protection mechanisms since nothing prevents subsequent immediate re-connection. I wish you could provide more details about such protections if installed/configured by default on pfSense, i am unaware of those.
This behaviour seems to be too consistent to be caused by gateways/backup node unreachabilities, it seems to be always something around 120 seconds, which remembers me a lot the default UDP timeout...

Sadly I can't find any reason left in the list you provided at this point.

How can i debug the cause of a killed state?
Is there a way to set the logging level to something so detailed?

Thanks

jimp

Hard to say what might be happening if none of those are relevant.

If the state deletes are coming over pfsync then you could see them by doing something like tcpdump -vvvnei <interface where pfsync traffic goes> proto 240, but there can be a ton of data there so may be hard to narrow down.

That could at least tell you which node is deleting the state though since the delete will either go primary -> secondary (in which case it was deleted locally) or secondary -> primary (in which case the secondary deleted it).

But trying to match up the IDs and find the relevant state among the data may still be tricky.