Pfsync not syncing states to backup (2.2.2)

podilarius

Hello Everyone,
I am trying to re-setup carp on a new cluster under a new version (2.2.2). I can get settings to sync, but all states are not syncing.
I have checked all rules and googled the crap out of this. No matter what I try, I cannot get the primary states to sync to the backup.
The same config worked under 2.1.5.
I have found an bug in Redmine (3876).
Has anyone else seen the same behavior and what is the fix?
Thanks

David Handelman

I have the same issue,
New install but no sync.
tcpdump shows traffic on the sync interface

podilarius

I am seeing traffic as well. pfsyncv5, but none of the states are making it into the state table on the secondary.
I am seeing "kernel: carp: demoted by 0 to 0 (pfsync bulk fail)" in the logs of both systems.
But I am also seeing:

Apr 27 18:47:57 	php-fpm[32692]: /system_hasync.php: pfsync done in 30 seconds.
Apr 27 18:47:25 	php-fpm[32692]: /system_hasync.php: waiting for pfsync...
Apr 27 18:47:24 	kernel: carp: demoted by 0 to 0 (pfsync bulk done)
Apr 27 18:47:24 	kernel: carp: demoted by 0 to 0 (pfsync bulk start)

According to the but, it was a change in the kernel. Could that have been reverted? Scanning the code now.

David Handelman

I have downgraded to 2.2.1 and 2.2-Release and it does not syncs…
Does it makes any sense that it's not working from the 2.2 release?

edinburgh1874

Do you have the backup/master referencing each other under "pfsync Synchronize Peer IP" on "State Synchronization Settings"?

I was receiving "kernel: carp: demoted by 0 to 0 (pfsync bulk fail)" on 2.2.2 as I mistakenly had set these to the master on both FWs.

@podilarius:

I am seeing traffic as well. pfsyncv5, but none of the states are making it into the state table on the secondary.
I am seeing "kernel: carp: demoted by 0 to 0 (pfsync bulk fail)" in the logs of both systems.
But I am also seeing:
Apr 27 18:47:57 	php-fpm[32692]: /system_hasync.php: pfsync done in 30 seconds.
Apr 27 18:47:25 	php-fpm[32692]: /system_hasync.php: waiting for pfsync...
Apr 27 18:47:24 	kernel: carp: demoted by 0 to 0 (pfsync bulk done)
Apr 27 18:47:24 	kernel: carp: demoted by 0 to 0 (pfsync bulk start)
According to the but, it was a change in the kernel. Could that have been reverted? Scanning the code now.

podilarius

I had it set that way and I also set it up for multicast. Neither seem to work, but I am noticing that some of the states from the openvpn connections are in there. None from LAN or WAN connection states are syncing.
Is yours currently working with the IPs correctly entered?

edinburgh1874

Yeah once I made this change it all looks good, haven't seen that error in the logs since rebooting.

Running 2.2.2 64-bit and 32-bit, in ESXi - I had to make sure the vSwitch was set to accept promiscuous traffic.

@podilarius:

I had it set that way and I also set it up for multicast. Neither seem to work, but I am noticing that some of the states from the openvpn connections are in there. None from LAN or WAN connection states are syncing.
Is yours currently working with the IPs correctly entered?

podilarius

I got that error to go away also once I went to multicast or directed. My states are still not syncing. I can fail over and fail back, but connections have to be re-established. Are you seeing the same, or are all your states syncing per normal?
I am going to setup a quick test on default setting to check on that as well.
Thank you for testing on your machine.

edinburgh1874

Everything is working as expected, states are synced and it fails over OK - no pings are dropped to 8.8.8.8 during the failover.

On my setup, I had mistakenly left the sync interface netmask to /31 on both FWs - it didn't give any XMLRC communication errors so I didn't notice any issue until I checked states.

Moving both interfaces to /24 stopped the "pfsync bulk fail" error, and caused states to be synced.

Switching back to /31 causes the issue to reappear.

Maybe check your IP config? Could you describe it a bit more?

@podilarius:

I got that error to go away also once I went to multicast or directed. My states are still not syncing. I can fail over and fail back, but connections have to be re-established. Are you seeing the same, or are all your states syncing per normal?
I am going to setup a quick test on default setting to check on that as well.
Thank you for testing on your machine.

podilarius

I am running IPv4 and IPv6 but only on LAN and WAN. I am using a dedicated 1GB NIC on both machines for CARP pfsync (cluster Network). I have 3 static openvpn connections. I am using mostly 1:1 NAT but I do have some port forwards. Both IPv6 and IPv4 is setup to use CARP. I am not getting any errors on either server.
I already have the network setup in a /24. This is and old cluster I am upgrading to 2.2.2 from 2.1.5. I am running unbound as DNS server, but I don't have by openvpn client export utility as an extra package.
It does appear that the openvpn states are syncing, but LAN and WAN interface themselves are not.

Edit
Forgot to add that I am using the Traffic Shaper as well.

#Edit#
I appears that the states opened directly to the LAN and WAN interfaces are syncing along with the OpenVPN states. tcpdump has something about "act UNKNOWN id {128,248}" and status: UNKOWN on some bulk updates. I can see the state update in the tcpdumps, but nothing is getting added to the state pool on the secondary.

podilarius

I have setup 2 VM in the lab and they are working with as many VIPs, OpenVPN, Skews, and just about everything except for the 1:1 nat mappings.
Where can I look to report something back on the one that is failing?

dbennett

I found something that might be of interest.

When you send the MASTER into maintenance mode, the state table show up on the BACKUP minus the actual interface IP's on the MASTER.

Is there a value somewhere in the sysclt that can be 'switched' to realtime sync not 'only when I need it'?

EDIT: In other words… is it possible to see if the packet that is being created by the MASTER to be sent to the BACKUP should be sent sooner? Is it possible that that 'state table packet' to be sent is being gutted because its waiting to long to update it? Something... Thoughts?

podilarius

Interesting thought. I took my same config and put it behind a simulated router (pfsense simulating the providers router) in a virtualized environment. There was no issues. So I am thinking it is a hardware issue. I have igbX on the main firewall and 2x reX and one em0 (cluster nic).
The secondary is a Atom 330 based system and the main is a C2758 based system. Any known issue here?

podilarius

JimP provided the answer on forum post https://forum.pfsense.org/index.php?topic=93132.msg519077#msg519077.
I don't have matching NICS for WAN and LAN. This seems like a departure from what has been done.
I am very curious why states now carry NIC info.

sepp_huber

I am very curious why states now carry NIC info.

Me to… :(

It took me a half day to understand the LAGGs and then to configure my HA-Setup with LAGGs. Because you cannot add LAGGs with used interfaces, one has to switch to temporary values in the interface assignment and that also requires rebooting for many times because of ARP-Problems.

The solution from here https://forum.pfsense.org/index.php?topic=93132.msg519077#msg519077 more precise:
You have to create a LAGG (Interface-assingment->LAGG) with Lag proto "FAILOVER" with one interface and then use this LAGG instead of the device before, i.e. rl0 => LAGG0 (LAN)

Everybody who uses not strictly the same hardware for Master and Backup is dealing with this problem.
Sorry, but this CARP change MUST BE listed in the Upgrade Guide!

Greetings

podilarius

LAGG groups are good, but clustering should take care not needing LAGG. Each firewall is plugged into a different switch as is. If a switch fails on primary, the secondary takes over, even if there is no hardware failure on primary.
Just saying.
Still this big of a change should have been listed in the upgrade guide like sepp_huber says.
At least there is a work around and we are not stuck.

podilarius

Just Confirmation.
I did the work around and the LAGG setup is working as intended.