No State Creator Host IDs visible
-
As this problem bit me several times now when upgrading CARP-clusters I tried to troubleshoot it a bit more. Here is what I found out:
I read somewhere that there were protocolchanges in pfsync between 2.7.0 and 2.7.2. So if you upgrade the nodes one by one there is the situation where one system is already on 2.7.2 and the other one still on 2.7.0. During this time 2.7.0 tries to sync states to 2.7.2 and vice versa. Due to the versionmismatch the statetable seems to have invalid records in the list that causes the too many creators error and causes the statesync to fail. Even after both nodes reach 2.7.2 those invalid records still remain in the statetables, which then still breaks the sync.
The resolution to this is:
Either reboot both nodes at the same time, so the broken records are not synced back and forth. Downtime about 1-2 minutes on fast systems.
Or disable the statesync via pfsync on both nodes (system>high availability), reset the statetables on both nodes (Diagnostics>States, Reset states). Then reenable the statesync again. Downtime a few seconds as only the dropped connections need to be reestablished.Unfortunately both procedures cause a disruption of traffic, though the second way of getting pfsync going again is rather short. An additional impacti is, that during the upgradeprocess statesync is broken from the time the first node reboots till the states get cleared on both nodes after both nodes are upgraded.
Hope this helps some people that run into the same issues.
Regards
Holger -
This seemed to work for one of our sets, so THANK YOU!
However, for anyone else that might be in the same boat, our state table was colsossal and therefore this may be a treatment, rather that a cure. *Or may be indicitave of an issue on one of the local networks
*I waited to hit submit until I found the issue - Camera system, wide open on separate VLAN in this case
We had the luxury (!) to run this remotely, in non peak times, with alternative, remote access, on the local intefaces - rather than WAN.
It took about 16 minutes for both on Xeon 3.2 physical, 8 vCpu, 8GB RAM 128GB fixed.
In fact, if you have alternative remote access to the local network, I'd recommend the exact of above with the states, wait for the states to clear, reboot each, then reenable 'System > High Availability > Synchronize states
I don't recommend doing this if you will have to travel several hours to complete on-site, and you don't have alternative remote access to the site. *just my 2¢