CARP VIP member recovery problems
I recently lost my master PFsense node in my CARP cluster. failover whent well, however i am having trouble with the VIP on the restored master. I took a backup of the secondary node and restored it on the primary then changed the interface IPs back to the original master. according to the logs CARP can communicate with the other node, VIP can slide back and forth, It can even push the config over from secondary to primary, however when the VIPs are on the restored master, they don't work. no ping reply, no VPN tunnels. as soon as i disable CARP on the master and the VIPs slide back to the secondary, everything works as expected. I have looked at the config on both units and made sure that the VIP weight is different between both. i even temporarily changed the VIP address on the master node to see if editing that record would give it a good kick, however i am out of ideas.
is there a known issue with recovering by restoring the config from one to another? if so, can someone provide the correct procedure? obviously i don't have independent backups of both or i would have used them.
one thing to add. i did power the master up with the secondary completely unplugged and i still did not get the VIPs to reply to pings or connect to IPSec tunnels.
please check the ha settings. pfSense should only sync settings from master to slave and not a loop.
If the slave is turned of, is the master showing master for the vip in Status > CARP?
@bepo Thanks for the quick reply. i should have mentioned this. I did disable config syncing completely when i booted it up for the first time, including the slave. my first boot was with the slave powered off. the master indicated it had the VIP (master) however no pings..etc. i then powered up the slave and watched them sync states in the log. i then forced a failover by using the disable button on the CARP status screen. once the VIPs moved to the slave everything worked. i then enabled config sync from slave to master only. checked the logs to confirm it actually did sync. rebooted both one at a time and am still in the state i described.
You didn't have a good backup of the primary to restore? That would be the best way.
HASYNC should have NEVER been configured on the secondary.
If nothing else you should have turned the secondary into the primary (setting advskew to 0, then enabling HASYNC, etc) then built a new secondary and synced the configuration over.
That's exactly what i did. What i was really asking in my OP is if i should configure the unit i want to sync with just enough IP info to preform the sync and not try to restore the whole config on top of a fresh install.
for the record, I did not have sync enabled from secondary to primary until i attempted to restore the primary. they were never syncing to each other in a loop. I am troubleshooting now since its after hours here. I restored the secondary to the primary again and unplugged the secondary. the primary is up, but i am finding that it is not replying to pings on the VIPs. They seem non functional since the VPN tunnels are not coming up as well. i think i actually have some type of VIP issue. has something changed with VIP since V2.3.4? I have a pretty complex config. I have VLANs on top of LAGG ports all connected to a cisco 6509. i didn't really want to go through the effort of rebuilding all of that just to do a sync.
No. The VIPs are generally the same as they were back then.
You have to configure all of the LAGGs, interfaces and VLANs. Those do not sync.
Ive given up for tonight and disabled CARP on the recovered one. i did notice the VIP very briefly replying to pings after I used the reboot button in the GUI. There were about 8 pings that replied on the VIP before all of IPs went dark. i remember troubleshooting an issue years ago with monowall that had the same behavior. i'm assuming the OS was unloading a filter which allowed things to work briefly. What is really bugging me about this is the fact that i have a working config in v2.3.4 with the exact same hardware, yet when i restore it to v2.4.4, it doesn't behave the same. It doesn't even work as a stand alone unit. Ive been avoiding this architecture jump for a while now. At this point i don't know if i can upgrade without building the whole cluster from scratch. i guess my next move is to pull v2.3.4 and try to restore it. i'm pretty confident that will work, but that will leave me in EOL software.
There is nothing systemically wrong with moving an HA pair from 2.3 to 2.4.
There hasn't been enough information posted to even hazard a guess as to what is wrong.
You said you wanted to sync the config so you wouldn't have to:
I have a pretty complex config. I have VLANs on top of LAGG ports all connected to a cisco 6509. i didn't really want to go through the effort of rebuilding all of that just to do a sync.
As I said, none of that is synced. You have to rebuild all of the interfaces, laggs, and VLANs in exactly the same order.
Can anyone tell me where the old versions are being stored these days?
I just put v2.3.5 on it and the VIPs work. There are a handful of other errors about curl and some other shared libraries. The GUI seems pretty unstable. I find myself going in the console and restarting PHP almost after every change. I can't run with this version, but I wanted to step through each version until I find one that breaks. I am however having trouble locating older versions. it appears they have been moved or removed. i want to get on a 64bit version that works until things are fixed. What happened to all of the previous versions? it's been almost a decade since an upgrade broke something i was doing, but not having older versions in an archive is really a disadvantage. I have a copy of every 32bit version I've ever installed, but i am trying to move into the 64bit world.
so it looks like v2.4.1 is where my VIPs stop working. i was able to upgrade from 2.3.5 to 2.4.1 from the console. this is how i figured it out. still bummed i cant find all of the previous installers. I reinstalled 2.3.5 again and prevented the packages from downloading after I restored my config. it looks like the package install for pfblocker was causing all of the curl and shared object errors.
at this point i think i am going to build the whole thing from scratch on the latest version. i will put everything back one at a time. maybe i will start with traditional interfaces, then add the IPsec tunnels and all of the FW rules. , then the vlans & LAGG. after that CARP and VIPs etc. hopefully by then pfblocker will be working. the installer seemed broken when i was testing with it on the latest version.
if anyone has anymore thoughts on how i can get to the bottom of this quickly i would really appreciate it.
CARP VIPs work fine in 2.4.4 and 2.4.4-1. You are probably going to have to define what you are seeing more carefully/completely.
when i was troubleshootingi was able to ping the VIPs from the ping screen within the webgui, but i am unable to ping them externally from the firewall. i've tried pinging from the same subnet on the LAN side and WAN no dice. ive dropped most of the firewall rules to reduce the complexity and created specific echo reply rules. ive run packet captures and can see my ICMP requests come into the interface, but no reply.
Ive also tried changing the type of VIP im using just to see if it would reply. remove and recreate also did not work
i've solved the problem. its very similar to bridge behavior i encountered in another installation. I only have vlans defined for my LAGG. once i created another interface that would be untagged on the LAGG, it picked up my native vlan as expected. all of the VIPs for the tagged interfaces started working.
so just for my own curiosity i deleted the native interface i crated and rebooted. everything still works. all in all i must have just jiggled the handle