Connections dropping under heavy load



  • We have a carp cluster with pfsense 2.4.4.-p3 using intel gigabit cards i350-P2 (SuperMicro with CPU 12 cores, 16GB Ram)

    When the networks load goes over 30 Mb/s aprox... connections crossing the FWs are dropped randomly... shh disconnects, etc... if we try to reconnect, they work without problem for some seconds or minutes and drops again...

    We have tried all combinations of checksum offloading, TSO, LRO... enabling disabling, etc..., with no better results...

    We have followed the guide tuning-and-troubleshooting-network-cards.html from pfsense with no better results...

    By the way, this problem was not present in old pfsense 2.2.x in this very hardware.

    Any idea, please???

    Thanks!

    P.


  • LAYER 8 Netgate

    That probably has nothing to do with the amount of traffic (30Mb/sec is pretty much nothing) but the number of states.

    What do the state levels look like?



  • Thx @Derelict..

    Values are on average...

    State table size: 2% (30513/1630000)
    and
    MBUF Usage: 1% (28616/2000000)

    i agree 309 Mbps is nothing... :-(

    Thanks!


  • LAYER 8 Netgate

    Do you have State Killing on Gateway Failure checked in System > Advanced, Miscellaneous on either node?

    i agree 309 Mbps is nothing... :-(

    Your OP said 30Mb/sec.



  • @Derelict

    Sorry, a typo... 30Mbps... not 309

    Thanks..


  • LAYER 8 Netgate

    Do you have State Killing on Gateway Failure checked in System > Advanced, Miscellaneous on either node?



  • @Derelict

    State Killing on Gateway Failure is unchecked

    Thanks for you kindness and help Derelict!


  • LAYER 8 Netgate

    On both nodes?

    Well, something is killing the states. The default expiration of an ESTABLISHED:ESTABLISHED TCP connection is 24-hours of zero traffic.

    People sometimes see this when adaptive pruning kicks in but at those state table levels that certainly should not be the case.

    Again, this would have nothing to do with traffic load but something killing the state.



  • @Derelict

    Your words make sense to me... i will dig out in that direction...

    Thanks again!



  • @Derelict said in Connections dropping under heavy load:

    ptive pruning kicks in but at those state table levels that certai

    Derelict,

    Currently i have this values:

    Firewall Maximum States: 1630000

    but

    net.pf.source_nodes_hashsize: 8192
    net.pf.states_hashsize: 32768

    are they correct? should not they be bigger?

    Thanks!


  • Netgate Administrator

    That's the default size and they are never normally an issue.

    One thing you might try here is to disable pfSync on the secondary. It's possible you have an interface mismatch and the secondary is syncing back states onto the wrong interface breaking them.
    If you no longer lose connections with that disabled check the config of both firewalls match exactly.
    Though that would not normally be load related.

    Steve



  • @stephenw10 said in Connections dropping under heavy load:

    That's the default size and they are never normally an issue.

    Thanks Stephen..

    When i do what you suggest the state table grows hugely. and very quickly... is that normal? and gets back to normal if i reactivate pfsync in Secondary.

    i am trying t dig our it it does make any difference....

    Thanks!


  • Netgate Administrator

    How huge? It might be the secondary was killing most states and now it is not...

    How many clients are behind it?

    Steve



  • @stephenw10

    UAU... Stephenw01 very interesting your remark... huge means (on average)... from 25.0000 sudden grow to 150.0000 entries... yes that huge! and back to 25.000 if secondary pfsync is enabled again.

    There are 15 clients behind the pfsense-cluster.

    Why the secondary would want to kill states?

    Thanks!


  • LAYER 8 Netgate

    Besides looking at the numbers of states (what is 150.0000 anyway? Is that one hundred fifty thousand or one million five hundred thousand?) does the issue with your states being killed (ssh sessions dying, etc) go away with pfsync disabled?

    As Steve mentioned the first thing to do is verify all of your interfaces match up.

    I use Diagnostics > Interfaces for this. The internal interface name (wan, lan, opt1, opt2, etc), the physical interface name (igb0, ix1, re2, vxnet4) all need to match exactly between primary and secondary. The description should not need to match but for consistency I would make them match.

    What you are seeing is not normal. There is obviously something wrong with your configuration. What that is is still unknown. Don't think either of us have ever see this exact behavior before.



  • Thanks Derelict...

    Sorry again for my typo: correct figure is 150.000

    I agree this does not look normal.

    I migrated from 2.1.5 (worked so good!)to 2.4.4-p3 by installing 2.4.4 from iso and then importing config from XML file.

    There was no error importing the old config (there were no packages installed) and the interfaces names, description, phisical device match exactly. In fact CARP is working.

    May the XML import have done anything in 2.4.4 to generate this problem? maybe something has been corrupted?

    Thanks again!


  • LAYER 8 Netgate

    Doubtful.

    You still have not answered the question: does the issue with your states being killed (ssh sessions dying, etc) go away with pfsync disabled?

    Perhaps you should post your settings instead of just saying they match. Cannot count the times a poster has said things are one way when they, in fact, are not.

    You did update both nodes to 2.4.4-p3 correct?


  • Netgate Administrator

    I mean 10k states per client does seem..... high! But it depends what those clients are doing. If those are all legitimate states then you could be hitting something else more quickly than we would otherwise expect.

    But, yeah, did disabling pfSync on the secondary correct the connection drops you were seeing?

    Steve


Log in to reply