Pfsync and CARP failover sequence
I have two pfSense boxes operating in a failover type setup – identically connected, syncing rules and connection states and sharing CARP ips for a bunch of stuff..
I think I have a sequence problem..
Doing a bit of testing, I rebooted my primary firewall box. As expected, all states were sync'd to the backup, and everything continued without a hiccup -- my SSH connections to the outside world didn't drop, RDP connections, etc.. The CARP vips flipped over and the backup picked right up where the primary left off.
However, when the primary restarted, I did indeed lose connection to some things. Some long running connections stayed up (as evidenced in pfTop), but many of the more active ones went down.
I'm assuming that this is related to a sequencing problem -- I'm guessing that the CARP vips flip over prior to PFSYNC taking care of syncing connection states.
Is this the experience that everyone has, and is there a way to correct it?
Make sure pfsync is enabled on both, using the right interface, and that you have entered the IP for the other unit into the pfsync peer box. Also make sure your rules on the sync interface will at least pass pfsync from the sync subnet to 'any'.
I did try explicitly setting the sync IP address in the High Avail Sync page (I had previously left it on the default, so it was using multicast)..
I still have the same behavior, though.. when the primary goes down and things fail over to the backup, my TCP connections stay up. However, when the primary finishes restarting and takes the CARP ips back, the more chatty TCP connections will drop. Those that aren't chatty will recover, though. The state table looks sync'd between the two after the restart, including states that were added while the backup was acting as primary.
Any other ideas to try to tweak things so I can restart the primary without noticing?
I am experiencing the exact same problem. I have tried tweaking the Advertising Frequency with no luck. I have also tried adjusting the SSH daemon on the server to keep the SSH session alive (TCPKeepAlive, ClientAliveInterval, ClientAliveCountMax) - no luck.
Did you ever get this issue resolved?
If you have pfsync set right it transitions both ways smoothly. Make sure it's enabled on both the primary and secondary units, with the correct interface selected, using the opposing unit's IP for the sync peer, and have firewall rules on both to pass the proper pfsync traffic
I have a dedicated LAN for the CARP SYNC. The rules are synced correctly to the slave and also the other Configuration Synchronization Settings that I have enabled.
I have a SSH connection from my computer to a server on another subnet routing through the the pfsense (two pfsenses setup with HA/CARP).
If I disable CARP on the master then the slave will take over fast and without any problem and if I reenable CARP on the master then the master will take over again with no problem.
I repeat this several times without a problem.
When I reboot the master then slave will take over, but when the master is done rebooting and takes over again then the SSH connection to the server will drop. This do not happen every time, but if it do not happen on the first reboot it will happen on the second reboot.
I have tried setting outbound NAT to the virtual IP and fiddling with other settings, but it is more like guessing… So any suggestions what may be causing this is highly appreciated. The pfsenses are running in an ESXi vmware 5.5 server.
I agree, this is my setup so far: (for tests)
WAN-Carp: 172.16.0.1/23 _LAN: 192.168.0.10/23
Any ping tests I do, have no issues, the failover is pretty seamless, however, if I run a SSH session running, at a failover [Master -> Slave or vice-versa], the SSH session fails.
Since the ping tests works I am inclined to say the Failover works, but the states are not being maintained at failover hence SSH fails. Any insight on what I might be missing. I followed instructions outlined here to get this up:
I am not sure if there is anything apart from setup itself that would cause this behavior. Both nodes are running on dedicated hardware._