Pfsync failing in 2.2.3 (worked fine in 2.2.2)

andydills

This past weekend, we upgraded a customer's pfsense cluster to 2.2.3.

Previously, when we migrated them from 2.1.5 to 2.2.2, we had problems with pfsync, as we didn't have consistent interface names and order. We corrected that so that they're using lagg interfaces, and the order is consistent on the two firewalls.

We upgraded them on Friday night, and everything was fine until this morning. This morning, the state table on the primary firewall was continually getting reset. You couldn't maintain a connection for more than a few seconds. I managed to get in long enough to disable the state sync, and everything normalized.

So, to restate the current situation:

State sync was working fine for weeks under 2.2.2.
Interface names and order are consistent between primary and secondary.
Upgraded to 2.2.3.
Pfsync worked fine for about 48 hours.
Pfsync began to cause the primary to reset the state table every few moments.
Disabling state sync fixed the problem.

Any thoughts or suggestions? The instability in the HA feature set is really starting to become a black eye.

jimp

It works fine here on 2.2.3. How do you have "state killing on gateway failure" set on System > Advanced, Misc tab? If the secondary thinks a gateway is down it could be resetting the states on you, which would then sync to the primary.

andydills

@jimp:

It works fine here on 2.2.3. How do you have "state killing on gateway failure" set on System > Advanced, Misc tab? If the secondary thinks a gateway is down it could be resetting the states on you, which would then sync to the primary.

Thank you for the response, I do believe you may be on to something, as the setting for this option is not the same on the two servers. The box was checked on the secondary, but not on the primary.

Can you provide a little confirmation as to the behavior? The wording is a little odd:

"The monitoring process will flush states for a gateway that goes down if this box is not checked. Check this box to disable this behavior."

So, if the box is checked, "State Killing on Gateway Failure" is disabled? That would seem to run counter to the rest of the pfsense interface, in which you check something to enable it, or the option should be titled "Disable State Killing on Gateway Failure".

I'm confused further because the primary had the option unchecked (meaning the state killing was, according to the text, enabled?), while the secondary had it checked (meaning it should have been disabled). Yet, I agree with your sentiment that the secondary was resetting the states on the primary. EDIT: The primary reason why I agree with you that the secondary was resetting the states is that the behavior ceased when I disabled state mirroring. If the primary was resetting it's own states as a result of gateway monitoring, disabling sync would not have fixed it. This seems to imply the wording of the "State Killing on Gateway Failure" is incorrect and backwards.

Note also that gateway monitoring is disabled for every gateway. Should the behavior not be the case that if gateway monitoring is disabled, all subfeatures are automatically disabled?

Edit: A little bit to add, there is absolutely nothing in any of the logs about any of this. If the states are being killed for a gateway failure, I would argue that should be in the gateway log, the system log, and it should probably be an alert that needs to be acknowledged in the GUI. Thoughts?

jimp

If a gateway goes down, it is logged in the gateways log and you'll see evidence in the main system log as well.

To stop the states from being cleared for that, uncheck the box from both sides, but that may not be the case here.

Worth unchecking on both to be certain though. Also check the gateway status on both, make sure they all show up. If a gateway is marked down and stays down it could do that as well, though it wouldn't necessarily always show a transition since it's down and staying down.

andydills

@jimp:

If a gateway goes down, it is logged in the gateways log and you'll see evidence in the main system log as well.

To stop the states from being cleared for that, uncheck the box from both sides, but that may not be the case here.

Worth unchecking on both to be certain though. Also check the gateway status on both, make sure they all show up. If a gateway is marked down and stays down it could do that as well, though it wouldn't necessarily always show a transition since it's down and staying down.

Ok, so are we in agreement that the boxes should be unchecked to disable the state clearing? If so, I guess I should submit a bug report to correct the wording?

jimp

the wording is correct, though confusing. Cleaning up those types of things is already on our list.

andydills

@jimp:

the wording is correct, though confusing. Cleaning up those types of things is already on our list.

Confusing indeed!

The wording states: "The monitoring process will flush states for a gateway that goes down if this box is not checked. Check this box to disable this behavior."

Am I crazy or does that state that checking the box disables the state flushing?

andydills

@jimp:

If a gateway goes down, it is logged in the gateways log and you'll see evidence in the main system log as well.

To stop the states from being cleared for that, uncheck the box from both sides, but that may not be the case here.

Worth unchecking on both to be certain though. Also check the gateway status on both, make sure they all show up. If a gateway is marked down and stays down it could do that as well, though it wouldn't necessarily always show a transition since it's down and staying down.

Just to confirm, there were no such logged gateway failures (only thing in the gateway log is about how apinger has no targets and is exiting), and gateways both show up (and no network events…looking at upstream traffic graphs, this must have happened almost right at 2 AM last night, but nothing corresponds to 2 AM in the logs of either firewall or the upstream switches).

So, I'm left to wonder:

Could the state resetting even get triggered if a gateway isn't being monitored? What is the mechanism here?
Is the mere fact that the options were different on the two servers enough to cause this problem, even if no gateways went down or were monitored to begin with? I would hope not, but my confidence is a bit lower at this point.
Is this possibly a memory corruption issue or other hardware related problem?

At this point I'm left with the unfortunate reality of advising that they are better off with state sync disabled, and the consequence of reset TCP sessions in the rare event of a failure, given that pfsync has caused several problems for them since the 2.2 release.