States being dropped when our cluster freezes!?
we've found an odd happening and we can't work out why it would be doing this.
We run a KVM/Qemu based cluster and we've being having a few I/O issues with it causing nearly all of the VMs (about 30) temporarily locking up and then continuing after about 1 minute.
We've done the maths and we've approximately 3000-4000 states passing through the firewall which are coming from cluster based machines and activities, but we have a total of 25000-30000 states from all of the systems on the LAN/DMZ/Others
I'd get it if we'd just have a small amount of change in the states - but we're loosing almost every state!
Once the cluster starts working, and the VMs start ticking, the states slowly creep backup and inbound network activity (even to non cluster based servers) returns to normal. I've attached a screen shot of the RRD state graphs.
Our pfSense firewall is NOT on the cluster and it running v2.0RC3. The logs don't show anything which would point out anything other than a few BOOTP/DHCP messages when the VMs start doing their thing.
What would cause the states to be cleared out?
States go away when:
- they're cleared, by the user or the system
- the connection is closed by the source or destination
- they timeout
#1 doesn't match because it doesn't drop to 0 immediately. If the problem that was happening caused something that reset all the states, like a gateway going offline will wipe the state table if that option isn't disabled, then there would be an immediate drop to 0.
#3 doesn't match because, assuming TCP, it's not nearly a long enough period of time for the states to timeout. The default timeout for established TCP connections is 24 hours.
#2 seems very likely. Even when the server-side disappears, the client side is still out there trying to communicate. The client will give up on the connection after some period of seconds (varying depending on OS) of no response and close the TCP session. That'll close out the state. That would lead to the semi-gradual decline (vs. dropping to 0 all at once) you're seeing there.
If connections to things other than the known problematic cluster are also being lost, maybe you're having a more significant network problem of some sort. Maybe something on that cluster you're not realizing affects the things off the cluster (database connections, many possibilities).
we do have a couple of our VPN gateways on the cluster, and these are known to the firewall.... We don't have the gateway monitoring States option ticked either! I'll give it a try and see if that has any effect.
Like I said, the number of states from/to the cluster is only in the thousands, not the tens of thousands. But the gateways going away could well be the issue!
Ticking that hasn't made a difference. I'll remove it and see how that gets on.
can you enable: net.inet.tcp.log_debug and syslog it off some where (will generate a lot of logs in your case), check the logs at the time and just before the drops, there should be a lot of activity in there and a reason why.
That sysctl has no relevance for traffic passing through the firewall, only if it's initiated by or to the firewall itself. You can manually hack in "set debug" at a higher level in /tmp/rules.debug and load that to log state removals.