2.2.2 sudden instability, TCP sessions: "Operation not permitted: write failed"

andydills

We upgraded to our cluster of 2 pfsense firewalls over the weekend. Everything was fine for about 48 hours, and then suddenly this morning tcp sessions would only work for a few seconds before locking up.

I manually failed over to the secondary, and everything works great through the secondary.

Logging into the console on the primary, when I ssh out to another server (on ANY interface…wan, lan, failover), within about 5 seconds max (during which ssh works fine), the ssh session drops with an error of "Operation not permitted: write failed". I also verified that telnet sessions (for example) also timeout and die after initially working for a second, just without a helpful error.

We're currently running fine on the secondary thankfully, but I can't seem to find any indicators of what could be causing this.

Suggestions?

stephenw10

What did you upgrade from?
Are you seeing any errors in the system log?
Check the dashboard. If you came from 2.1.X make sure it's not reporting FreeBSD 8.3 there still. If so it hasn't rebooted correctly.

Steve

andydills

Upgraded from 2.0.3.

After the upgrade to 2.2.2, it worked up for about 48 hours, up until 4 AM this morning when it suddenly stopped passing much traffic.

Nothing notable in any of the logs that I can see, aside from "sshd[41070]: fatal: Write failed: Operation not permitted"

The same config is currently working fine on the secondary (which also went 2.0.3->2.2.2).

Edit: Yes, I can confirm it's fully upgraded to 2.2.2…this is a datacenter environment and the upgrade was done on-site, and also the box has been rebooted a couple of times since with no improvement.

mer

du -sh /
make sure you have diskspace.
snort running?

Supermule

Do a backup of the config on 2.0.3 and reinstall a vanilla 2.2.2

I had to do it that way since the upgrade from 2.1.5 was not working.

andydills

@mer:

du -sh /
make sure you have diskspace.
snort running?

No snort, plenty of diskspace.

I'm fairly certain the write error is relating to writing to the network socket descriptor, not writing to the disk.

andydills

@Supermule:

Do a backup of the config on 2.0.3 and reinstall a vanilla 2.2.2

I had to do it that way since the upgrade from 2.1.5 was not working.

Hmm…I'm not against trying that, but why would it have worked fine on 2.2.2 for almost two full days?

Supermule

Good question but I experienced a lot of issues when running the upgrade among those were missing .ko files which made it into the release full install but not the upgrade files.

I had to kill the 2.1.5 since it was messed up after the upgrade.

mer

@andydills:

@mer:

du -sh /
make sure you have diskspace.
snort running?

No snort, plenty of diskspace.

I'm fairly certain the write error is relating to writing to the network socket descriptor, not writing to the disk.

Ok, then it would likely be a queue not draining somewhere. There should be commands that let you look at some things. There may be some information here: https://calomel.org/freebsd_network_tuning.html look at the sysctl.conf section.

Have you been up 48hrs on the secondary yet? It would be interesting datapoint if the secondary does not show the same issue after 48 hrs.

andydills

Figured it out.

I have to say, what a huge letdown from the pfsense team for not mentioning this absolutely enormous change to pfsync in 2.2:

https://forum.pfsense.org/index.php?topic=93132.msg519077

The usual reason on 2.2.x for states to not sync is that the interfaces are mismatched. States in 2.2.x are interface-bound, meaning the interface is a part of the state. For example if the primary node has igb(4) NICs and the secondary has em(4), the states can't sync.

That can be worked around in a silly way by adding the NICs to single interface laggs so the states would be on lagg(4) interfaces on both.

This is why I'm having problems. The firewalls do not have consistent interface names. Why is this NOT at the top of the upgrade guide, in bold letters?

Once I disabled state table sync, the behavior of the firewall returned to normal. Tonight, I'll be implementing some workarounds, but seriously…this is just sloppy. For such a tremendous change, one which causes instability to the point of uselessness (try doing something when the state table resets every 5 seconds), this needs to be well documented and made clear.

divsys

This is why I'm having problems. The firewalls do not have consistent interface names. Why is this NOT at the top of the upgrade guide, in bold letters?

Probably because not everyone implements pfSense in a HA or pfsync setup and there were other changes that might have been considered more pressing (that's just my guess).
I know there were some similar discussions in the CARP/VIPs section. You might want to review what's there for any more gotchas.

Glad you got it up and running.

andydills

I guess…I don't see anything else really on the upgrade guide that deals with potentially outage-causing issues like this, and they have whole sections on HA considerations.

You could also say, most people doing HA already have lagg groups configured. And while that is also true, it also doesn't excuse the omission of this critical piece of data.

Thanks for the followups and suggestions though, I don't mean to sound ungrateful, this is just a bit too sloppy for what I've come to expect from the pfsense team.