Connections dropping under heavy load

pedreter

We have a carp cluster with pfsense 2.4.4.-p3 using intel gigabit cards i350-P2 (SuperMicro with CPU 12 cores, 16GB Ram)

When the networks load goes over 30 Mb/s aprox... connections crossing the FWs are dropped randomly... shh disconnects, etc... if we try to reconnect, they work without problem for some seconds or minutes and drops again...

We have tried all combinations of checksum offloading, TSO, LRO... enabling disabling, etc..., with no better results...

We have followed the guide tuning-and-troubleshooting-network-cards.html from pfsense with no better results...

By the way, this problem was not present in old pfsense 2.2.x in this very hardware.

Any idea, please???

Thanks!

P.

Derelict

That probably has nothing to do with the amount of traffic (30Mb/sec is pretty much nothing) but the number of states.

What do the state levels look like?

pedreter

Thx @Derelict..

Values are on average...

State table size: 2% (30513/1630000)
and
MBUF Usage: 1% (28616/2000000)

i agree 309 Mbps is nothing... :-(

Thanks!

Derelict

Do you have State Killing on Gateway Failure checked in System > Advanced, Miscellaneous on either node?

i agree 309 Mbps is nothing... :-(

Your OP said 30Mb/sec.

pedreter

@Derelict

Sorry, a typo... 30Mbps... not 309

Thanks..

Derelict

Do you have State Killing on Gateway Failure checked in System > Advanced, Miscellaneous on either node?

pedreter

@Derelict

State Killing on Gateway Failure is unchecked

Thanks for you kindness and help Derelict!

Derelict

On both nodes?

Well, something is killing the states. The default expiration of an ESTABLISHED:ESTABLISHED TCP connection is 24-hours of zero traffic.

People sometimes see this when adaptive pruning kicks in but at those state table levels that certainly should not be the case.

Again, this would have nothing to do with traffic load but something killing the state.

pedreter

@Derelict

Your words make sense to me... i will dig out in that direction...

Thanks again!

pedreter

@Derelict said in Connections dropping under heavy load:

ptive pruning kicks in but at those state table levels that certai

Derelict,

Currently i have this values:

Firewall Maximum States: 1630000

but

net.pf.source_nodes_hashsize: 8192
net.pf.states_hashsize: 32768

are they correct? should not they be bigger?

Thanks!

stephenw10

That's the default size and they are never normally an issue.

One thing you might try here is to disable pfSync on the secondary. It's possible you have an interface mismatch and the secondary is syncing back states onto the wrong interface breaking them.
If you no longer lose connections with that disabled check the config of both firewalls match exactly.
Though that would not normally be load related.

Steve

pedreter

@stephenw10 said in Connections dropping under heavy load:

That's the default size and they are never normally an issue.

Thanks Stephen..

When i do what you suggest the state table grows hugely. and very quickly... is that normal? and gets back to normal if i reactivate pfsync in Secondary.

i am trying t dig our it it does make any difference....

Thanks!

stephenw10

How huge? It might be the secondary was killing most states and now it is not...

How many clients are behind it?

Steve

pedreter

@stephenw10

UAU... Stephenw01 very interesting your remark... huge means (on average)... from 25.0000 sudden grow to 150.0000 entries... yes that huge! and back to 25.000 if secondary pfsync is enabled again.

There are 15 clients behind the pfsense-cluster.

Why the secondary would want to kill states?

Thanks!

Derelict

Besides looking at the numbers of states (what is 150.0000 anyway? Is that one hundred fifty thousand or one million five hundred thousand?) does the issue with your states being killed (ssh sessions dying, etc) go away with pfsync disabled?

As Steve mentioned the first thing to do is verify all of your interfaces match up.

I use Diagnostics > Interfaces for this. The internal interface name (wan, lan, opt1, opt2, etc), the physical interface name (igb0, ix1, re2, vxnet4) all need to match exactly between primary and secondary. The description should not need to match but for consistency I would make them match.

What you are seeing is not normal. There is obviously something wrong with your configuration. What that is is still unknown. Don't think either of us have ever see this exact behavior before.

pedreter

Thanks Derelict...

Sorry again for my typo: correct figure is 150.000

I agree this does not look normal.

I migrated from 2.1.5 (worked so good!)to 2.4.4-p3 by installing 2.4.4 from iso and then importing config from XML file.

There was no error importing the old config (there were no packages installed) and the interfaces names, description, phisical device match exactly. In fact CARP is working.

May the XML import have done anything in 2.4.4 to generate this problem? maybe something has been corrupted?

Thanks again!

Derelict

Doubtful.

You still have not answered the question: does the issue with your states being killed (ssh sessions dying, etc) go away with pfsync disabled?

Perhaps you should post your settings instead of just saying they match. Cannot count the times a poster has said things are one way when they, in fact, are not.

You did update both nodes to 2.4.4-p3 correct?

stephenw10

I mean 10k states per client does seem..... high! But it depends what those clients are doing. If those are all legitimate states then you could be hitting something else more quickly than we would otherwise expect.

But, yeah, did disabling pfSync on the secondary correct the connection drops you were seeing?

Steve