[SOLVED] VIP fails over to slave but does not go back to master
-
I've searched the forum and internet for 2 days already and I still cannot find the solution to my problem. This setup was working in 2.2 and it was later upgraded to 2.3 and now it is on 2.3.4
I have two VMs (master and slave) each with 3 interfaces (WAN/SYNC/LAN). This was a working setup but recently broke and I do not know why.. So the order of all interfaces match, all settings in each interface match and net.inet.carp.demotion = 0.
When both VMs are up, master being the master has all VIPs and everything is going through master. When I do "tcpdump -i <ifname>-ttt -n proto CARP", I can clearly see the master VIPs are broadcasting.
When I hit reboot button on master, everything fails over to slave perfectly fine. However, when master comes back online, there is no internet connectivity. I can see the master is broadcasting in tcpdump, no errors in system log. In system log, I see the slave is changing status back to Backup and master's CARP status changing back to Master with this: carp: VHID 130@hn0: BACKUP -> MASTER (preempting a slower master).
The only way to get everything back in order is turn off BOTH master and slave. Turn on master first, then follow by slave so master has enough time to claim the master status.
Most of the cases I see in here do not have fail over working. In my case, failing master does pass the baton to slave. However, when master returns, master claims to be master, slave returns to be backup but the VIP is not reachable.
I think this only happens to my LAN VIP and my WAN VIPs are not affected because I can connect to the VPN, it's just I can't get anywhere because the LAN VIP is not responding (which acts as the GW for all internal network).</ifname>
-
I just want to provide an update.. I did a "wait" test and there seems to be a 15 minute timer somewhere… after I reboot the master and lost connectivity to the internet, it would resume in 15 minutes...
-
Another thing I did was, I tried to change something on master and hit Apply after it resumed in 15 minutes, this caused the master VM to freeze. On the slave VM, I looked up the tcpdump and nothing is being broadcasted
-
Here are the system logs when I applied new settings and caused both systems to panic:
Master VM:
06/01/2017 8:19 php-fpm 99885 /rc.filter_synchronize: Beginning XMLRPC sync to https://"SYNC.INTERFACE":443. 06/01/2017 8:19 check_reload_status Syncing firewall 06/01/2017 8:20 check_reload_status Syncing firewall 06/01/2017 8:20 check_reload_status Syncing firewall 06/01/2017 8:20 php-fpm 7014 /rc.filter_synchronize: Beginning XMLRPC sync to https://"SYNC.INTERFACE":443. 06/01/2017 8:20 php-fpm 15977 /rc.filter_synchronize: Beginning XMLRPC sync to https://"SYNC.INTERFACE":443. 06/01/2017 8:20 php-fpm 16151 /rc.filter_synchronize: Beginning XMLRPC sync to https://"SYNC.INTERFACE":443. 06/01/2017 8:20 php-fpm 99885 /rc.filter_synchronize: XMLRPC sync successfully completed with https://"SYNC.INTERFACE":443. 06/01/2017 8:20 php-fpm 7014 /rc.filter_synchronize: XMLRPC sync successfully completed with https://"SYNC.INTERFACE":443. 06/01/2017 8:21 php-fpm 15977 /rc.filter_synchronize: XML_RPC_Client: RPC server did not send response before timeout. 103 06/01/2017 8:21 php-fpm 15977 /rc.filter_synchronize: A communications error occurred while attempting XMLRPC sync with username admin https://"SYNC.INTERFACE":443. 06/01/2017 8:21 php-fpm 15977 /rc.filter_synchronize: New alert found: A communications error occurred while attempting XMLRPC sync with username admin https://"SYNC.INTERFACE":443. 06/01/2017 8:21 php-fpm 15977 /rc.filter_synchronize: Beginning XMLRPC sync to https://"SYNC.INTERFACE":443. 06/01/2017 8:21 php-fpm 16151 /rc.filter_synchronize: XML_RPC_Client: RPC server did not send response before timeout. 103 06/01/2017 8:21 php-fpm 16151 /rc.filter_synchronize: A communications error occurred while attempting XMLRPC sync with username admin https://"SYNC.INTERFACE":443. 06/01/2017 8:21 php-fpm 16151 /rc.filter_synchronize: New alert found: A communications error occurred while attempting XMLRPC sync with username admin https://"SYNC.INTERFACE":443. 06/01/2017 8:21 php-fpm 16151 /rc.filter_synchronize: Beginning XMLRPC sync to https://"SYNC.INTERFACE":443. 06/01/2017 8:21 kernel hn1: promiscuous mode disabled 06/01/2017 8:21 check_reload_status Carp backup event 06/01/2017 8:21 php-fpm 99885 /rc.filter_synchronize: XML_RPC_Client: RPC server did not send response before timeout. 103 06/01/2017 8:21 php-fpm 99885 /rc.filter_synchronize: A communications error occurred while attempting Filter sync with username admin https://"SYNC.INTERFACE":443. 06/01/2017 8:21 php-fpm 99885 /rc.filter_synchronize: New alert found: A communications error occurred while attempting Filter sync with username admin https://"SYNC.INTERFACE":443.
Slave VM:
06/01/2017 8:20 check_reload_status Syncing firewall 06/01/2017 8:20 check_reload_status Carp backup event 06/01/2017 8:20 kernel ifa_del_loopback_route: deletion failed: 3 06/01/2017 8:20 kernel carp: VHID 1@hn1: INIT -> BACKUP 06/01/2017 8:20 check_reload_status Carp backup event 06/01/2017 8:20 check_reload_status Carp backup event 06/01/2017 8:20 kernel ifa_del_loopback_route: deletion failed: 3 06/01/2017 8:20 kernel hn1: promiscuous mode disabled 06/01/2017 8:20 php-fpm 44358 /rc.carpbackup: HA cluster member "("LAN CARP VIP"@hn1): (LAN)" has resumed CARP state "BACKUP" for vhid 1 06/01/2017 8:20 php-fpm 44358 /rc.carpbackup: HA cluster member "("LAN CARP VIP"@hn1): (LAN)" has resumed CARP state "BACKUP" for vhid 1 06/01/2017 8:20 php-fpm 48909 /xmlrpc.php: waiting for pfsync... 06/01/2017 8:21 php-fpm 48909 /xmlrpc.php: pfsync done in 30 seconds. 06/01/2017 8:21 php-fpm 48909 /xmlrpc.php: Configuring CARP settings finalize... 06/01/2017 8:21 check_reload_status Syncing firewall 06/01/2017 8:21 kernel ifa_del_loopback_route: deletion failed: 3 06/01/2017 8:21 kernel carp: VHID 130@hn0: INIT -> BACKUP 06/01/2017 8:21 kernel hn1: promiscuous mode enabled 06/01/2017 8:21 kernel carp: VHID 1@hn1: INIT -> BACKUP 06/01/2017 8:21 kernel ifa_del_loopback_route: deletion failed: 3 06/01/2017 8:21 kernel ifa_del_loopback_route: deletion failed: 3 06/01/2017 8:21 kernel hn1: promiscuous mode disabled 06/01/2017 8:21 check_reload_status Carp backup event 06/01/2017 8:21 check_reload_status Carp backup event 06/01/2017 8:21 check_reload_status Carp backup event 06/01/2017 8:21 check_reload_status Carp backup event 06/01/2017 8:21 check_reload_status Carp backup event 06/01/2017 8:21 php-fpm 55680 /xmlrpc.php: waiting for pfsync... 06/01/2017 8:21 php-fpm 55680 /xmlrpc.php: pfsync done in 30 seconds. 06/01/2017 8:21 php-fpm 55680 /xmlrpc.php: Configuring CARP settings finalize... 06/01/2017 8:21 check_reload_status Syncing firewall 06/01/2017 8:21 kernel carp: VHID 130@hn0: INIT -> BACKUP 06/01/2017 8:21 kernel hn1: promiscuous mode enabled 06/01/2017 8:21 kernel carp: VHID 1@hn1: INIT -> BACKUP 06/01/2017 8:21 kernel ifa_del_loopback_route: deletion failed: 3 06/01/2017 8:21 kernel carp: VHID 135@hn0: INIT -> BACKUP 06/01/2017 8:21 check_reload_status Carp backup event 06/01/2017 8:21 kernel ifa_del_loopback_route: deletion failed: 3 06/01/2017 8:21 check_reload_status Carp backup event 06/01/2017 8:21 kernel ifa_del_loopback_route: deletion failed: 3 06/01/2017 8:21 check_reload_status Carp backup event 06/01/2017 8:21 kernel hn1: promiscuous mode disabled 06/01/2017 8:21 kernel ifa_del_loopback_route: deletion failed: 3 06/01/2017 8:21 check_reload_status Carp backup event 06/01/2017 8:21 check_reload_status Carp backup event 06/01/2017 8:21 check_reload_status Carp backup event 06/01/2017 8:21 check_reload_status Carp backup event 06/01/2017 8:21 php-fpm 55680 /rc.carpbackup: HA cluster member "(xxx.xxx.xxx.130@hn0): (WAN)" has resumed CARP state "BACKUP" for vhid 130 06/01/2017 8:21 php-fpm 55680 /rc.carpbackup: HA cluster member "("LAN CARP VIP"@hn1): (LAN)" has resumed CARP state "BACKUP" for vhid 1 06/01/2017 8:21 php-fpm 55680 /rc.carpbackup: HA cluster member "(xxx.xxx.xxx.130@hn0): (WAN)" has resumed CARP state "BACKUP" for vhid 130 06/01/2017 8:21 php-fpm 55680 /rc.carpbackup: HA cluster member "("LAN CARP VIP"@hn1): (LAN)" has resumed CARP state "BACKUP" for vhid 1 06/01/2017 8:22 php-fpm 63143 /xmlrpc.php: waiting for pfsync... 06/01/2017 8:22 kernel carp: VHID 97@hn0: BACKUP -> MASTER (master down) 06/01/2017 8:22 check_reload_status Carp master event 06/01/2017 8:22 php-fpm 63143 /xmlrpc.php: pfsync done in 30 seconds. 06/01/2017 8:22 php-fpm 63143 /xmlrpc.php: Configuring CARP settings finalize... 06/01/2017 8:22 check_reload_status Syncing firewall 06/01/2017 8:22 kernel carp: VHID 130@hn0: INIT -> BACKUP 06/01/2017 8:22 kernel hn1: promiscuous mode enabled 06/01/2017 8:22 kernel carp: VHID 1@hn1: INIT -> BACKUP 06/01/2017 8:22 kernel carp: VHID 135@hn0: INIT -> BACKUP 06/01/2017 8:22 check_reload_status Carp backup event 06/01/2017 8:22 check_reload_status Carp backup event 06/01/2017 8:22 check_reload_status Carp backup event 06/01/2017 8:22 kernel carp: VHID 140@hn0: INIT -> BACKUP 06/01/2017 8:22 check_reload_status Carp backup event 06/01/2017 8:22 check_reload_status Carp backup event 06/01/2017 8:22 check_reload_status Carp backup event 06/01/2017 8:22 kernel ifa_del_loopback_route: deletion failed: 3 06/01/2017 8:22 check_reload_status Carp backup event 06/01/2017 8:22 kernel ifa_del_loopback_route: deletion failed: 3 06/01/2017 8:22 check_reload_status Carp backup event 06/01/2017 8:22 kernel hn1: promiscuous mode disabled 06/01/2017 8:22 kernel ifa_del_loopback_route: deletion failed: 3 06/01/2017 8:22 check_reload_status Carp backup event 06/01/2017 8:22 kernel ifa_del_loopback_route: deletion failed: 3 06/01/2017 8:22 php-fpm 63143 /rc.carpbackup: HA cluster member "(xxx.xxx.xxx.130@hn0): (WAN)" has resumed CARP state "BACKUP" for vhid 130 06/01/2017 8:22 kernel hn0: promiscuous mode disabled 06/01/2017 8:22 php-fpm 63143 /rc.carpbackup: HA cluster member "("LAN CARP VIP"@hn1): (LAN)" has resumed CARP state "BACKUP" for vhid 1 06/01/2017 8:22 php-fpm 63143 /rc.carpbackup: HA cluster member "(xxx.xxx.xxx.135@hn0): (WAN)" has resumed CARP state "BACKUP" for vhid 135 06/01/2017 8:22 php-fpm 63143 /rc.carpbackup: HA cluster member "(xxx.xxx.xxx.130@hn0): (WAN)" has resumed CARP state "BACKUP" for vhid 130 06/01/2017 8:22 php-fpm 63143 /rc.carpbackup: HA cluster member "("LAN CARP VIP"@hn1): (LAN)" has resumed CARP state "BACKUP" for vhid 1 06/01/2017 8:22 php-fpm 63143 /rc.carpbackup: HA cluster member "(xxx.xxx.xxx.135@hn0): (WAN)" has resumed CARP state "BACKUP" for vhid 135 06/01/2017 8:22 php-fpm 63143 /rc.carpbackup: HA cluster member "(xxx.xxx.xxx.130@hn0): (WAN)" has resumed CARP state "BACKUP" for vhid 130 06/01/2017 8:22 php-fpm 63143 /rc.carpbackup: HA cluster member "("LAN CARP VIP"@hn1): (LAN)" has resumed CARP state "BACKUP" for vhid 1 06/01/2017 8:22 php-fpm 63143 /rc.carpbackup: HA cluster member "(xxx.xxx.xxx.135@hn0): (WAN)" has resumed CARP state "BACKUP" for vhid 135 06/01/2017 8:22 php-fpm 63143 /rc.carpbackup: HA cluster member "(xxx.xxx.xxx.140@hn0): (WAN)" has resumed CARP state "BACKUP" for vhid 140 06/01/2017 8:22 php-fpm 63143 /rc.carpbackup: HA cluster member "(xxx.xxx.xxx.130@hn0): (WAN)" has resumed CARP state "BACKUP" for vhid 130 06/01/2017 8:22 php-fpm 63143 /rc.carpbackup: HA cluster member "("LAN CARP VIP"@hn1): (LAN)" has resumed CARP state "BACKUP" for vhid 1 06/01/2017 8:22 php-fpm 63143 /rc.carpbackup: HA cluster member "(xxx.xxx.xxx.135@hn0): (WAN)" has resumed CARP state "BACKUP" for vhid 135 06/01/2017 8:22 php-fpm 63143 /rc.carpbackup: HA cluster member "(xxx.xxx.xxx.140@hn0): (WAN)" has resumed CARP state "BACKUP" for vhid 140 06/01/2017 8:22 php-fpm 44358 /xmlrpc.php: waiting for pfsync... 06/01/2017 8:22 php-fpm 44358 /xmlrpc.php: pfsync done in 0 seconds. 06/01/2017 8:22 php-fpm 44358 /xmlrpc.php: Configuring CARP settings finalize... 06/01/2017 8:22 php-fpm 48909 /xmlrpc.php: ROUTING: setting default route to xxx.xxx.xxx.129 06/01/2017 8:22 php-fpm 48909 /xmlrpc.php: Resyncing OpenVPN instances. 06/01/2017 8:22 check_reload_status Reloading filter 06/01/2017 8:23 kernel hn1: promiscuous mode enabled 06/01/2017 8:26 kernel hn1: promiscuous mode disabled 06/01/2017 8:26 php-cgi rc.initial.halt: Stopping all packages. 06/01/2017 8:26 shutdown power-down by root:
hn0 = WAN
hn1 = LAN
hn2 = SYNCI noticed interfaces had promiscuous mode disabled/enabled.
-
You may want to check in the Virtualization forum to confirm your hypervisor settings are correct.
-
I am just updating the status of the issue here. From what I remembered, when it was working, these VMs were in the same versions of HyperV. However, I moved the slave to a new host few months ago so these had different versions of Hyper V hosts (one 2016 and one 2012R2). I moved the master to the new host and now the fail over and fail back work fine.
I am still uncertain right now it is working because both VMs are residing on the same host or different versions of hosts. I will perform an upgrade to one of my older nodes to bring it to 2016 and perform this test.
Will report back once I get an update.
-
Just reporting back my success.
I successfully brought up a temporary node to the same version as my other node. Moved the slave VM over and tested the fail over. Both way worked.
I later had a look at the event logs and I saw the incompatibility of the integration software on VM on my host.
All these trouble and it was because the VM on the node didn't have the right version of integration software….
I hope this can help others too... If you are running pfSense on a VM, make sure you check the integration software and have the correct version installed. Sometimes when you migrate back and forth, you lose track on the software version and it may not be compatible with the host's version!
Thank you.