@jimp:
If a gateway goes down, it is logged in the gateways log and you'll see evidence in the main system log as well.
To stop the states from being cleared for that, uncheck the box from both sides, but that may not be the case here.
Worth unchecking on both to be certain though. Also check the gateway status on both, make sure they all show up. If a gateway is marked down and stays down it could do that as well, though it wouldn't necessarily always show a transition since it's down and staying down.
Just to confirm, there were no such logged gateway failures (only thing in the gateway log is about how apinger has no targets and is exiting), and gateways both show up (and no network events…looking at upstream traffic graphs, this must have happened almost right at 2 AM last night, but nothing corresponds to 2 AM in the logs of either firewall or the upstream switches).
So, I'm left to wonder:
Could the state resetting even get triggered if a gateway isn't being monitored? What is the mechanism here?
Is the mere fact that the options were different on the two servers enough to cause this problem, even if no gateways went down or were monitored to begin with? I would hope not, but my confidence is a bit lower at this point.
Is this possibly a memory corruption issue or other hardware related problem?
At this point I'm left with the unfortunate reality of advising that they are better off with state sync disabled, and the consequence of reset TCP sessions in the rare event of a failure, given that pfsync has caused several problems for them since the 2.2 release.