HA Cluster secondary fw primary interface in master status after upgrade
-
I have noticed twice now after upgrading the secondary firewall that the secondary comes up as master for the primary interface (all others are BACKUP status on secondary). The primary firewall shows all interfaces as MASTER. The primary firewall is still at the previous 2.1.1 snapshot from feb 19th on . The last time I did an upgrade of the secondary I just disabled and re-enabled carp on the secondary and it corrected itself and failovers back and forth worked fine after that.
Anyone else seeing similar behavior? I am using ET2 gigabit cards on a Dell R320. This is in a test environment using dell powerconnect switches for the interfaces.
Primary:
2.1.1-PRERELEASE (amd64)
built on Wed Feb 19 00:11:10 EST 2014
FreeBSD 8.3-RELEASE-p14Secondary:
2.1.1-PRERELEASE (amd64)
built on Sat Mar 15 12:10:39 EDT 2014
FreeBSD 8.3-RELEASE-p14 -
More detailed info…
Both firewalls appear to be seeing each others carp traffic on the wan interface which is the only interface in master status on both firewalls. The time is off by about 16 seconds between firewalls but from my limited understanding of the protocol it seems that is not important with carp.
I am not using vlans on any interface and this is a single WAN gateway setup. Not using ipv6.
x.x.x.37 is primary firewall external ip
x.x.x.38 is secondary firewall external ipPrimary firewall:
12:08:40.281153 IP x.x.x.37 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 0, authtype none, intvl 1s, length 36
12:08:41.114040 IP x.x.x.38 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 100, authtype none, intvl 1s, length 36
12:08:41.282143 IP x.x.x.37 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 0, authtype none, intvl 1s, length 36
12:08:42.283144 IP x.x.x.37 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 0, authtype none, intvl 1s, length 36
12:08:42.506052 IP x.x.x.38 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 100, authtype none, intvl 1s, length 36
12:08:43.284143 IP x.x.x.37 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 0, authtype none, intvl 1s, length 36
12:08:43.898062 IP x.x.x.38 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 100, authtype none, intvl 1s, length 36igb0: flags=8943 <up,broadcast,running,promisc,simplex,multicast>metric 0 mtu 1500
options=400bb <rxcsum,txcsum,vlan_mtu,vlan_hwtagging,jumbo_mtu,vlan_hwcsum,vlan_hwtso>ether xxx:x
inet x.x.x.37 netmask 0xffffffe0 broadcast x.x.x.63
inet6 fe80::92e2:baff:fe3b:c190%igb0 prefixlen 64 scopeid 0x1
nd6 options=1 <performnud>media: Ethernet autoselect (1000baseT <full-duplex>)
status: activenet.inet.ip.same_prefix_carp_only: 0
net.inet.carp.allow: 1
net.inet.carp.preempt: 1
net.inet.carp.log: 1
net.inet.carp.arpbalance: 0
net.inet.carp.suppress_preempt: 0
net.link.ether.inet.carp_mac: 0Secondary firewall:
12:08:57.980790 IP x.x.x.37 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 0, authtype none, intvl 1s, length 36
12:08:58.203671 IP x.x.x.38 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 100, authtype none, intvl 1s, length 36
12:08:58.981795 IP x.x.x.37 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 0, authtype none, intvl 1s, length 36
12:08:59.595691 IP x.x.x.38 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 100, authtype none, intvl 1s, length 36
12:08:59.982802 IP x.x.x.37 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 0, authtype none, intvl 1s, length 36
12:09:00.983814 IP x.x.x.37 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 0, authtype none, intvl 1s, length 36
12:09:00.987714 IP x.x.x.38 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 100, authtype none, intvl 1s, length 36igb0: flags=8943 <up,broadcast,running,promisc,simplex,multicast>metric 0 mtu 1500
options=400bb <rxcsum,txcsum,vlan_mtu,vlan_hwtagging,jumbo_mtu,vlan_hwcsum,vlan_hwtso>ether xxx:x
inet x.x.x.38 netmask 0xffffffe0 broadcast x.x.x.63
inet6 fe80::92e2:baff:fe39:6398%igb0 prefixlen 64 scopeid 0x1
nd6 options=1 <performnud>media: Ethernet autoselect (1000baseT <full-duplex>)
status: activenet.inet.ip.same_prefix_carp_only: 0
net.inet.carp.allow: 1
net.inet.carp.preempt: 1
net.inet.carp.log: 1
net.inet.carp.arpbalance: 0
net.inet.carp.suppress_preempt: 0
net.link.ether.inet.carp_mac: 0</full-duplex></performnud></rxcsum,txcsum,vlan_mtu,vlan_hwtagging,jumbo_mtu,vlan_hwcsum,vlan_hwtso></up,broadcast,running,promisc,simplex,multicast></full-duplex></performnud></rxcsum,txcsum,vlan_mtu,vlan_hwtagging,jumbo_mtu,vlan_hwcsum,vlan_hwtso></up,broadcast,running,promisc,simplex,multicast> -
Actually it is important for calculating how long a difference they are.
Usually not that important to impact such scenario. -
I synced the time on the servers and that didn't change anything as expected.
I am leaving it stuck in this mode to hopefully figure out the issue instead of disabling carp and enabling it again (which fixed it in the past) and worrying it will happen in a production environment. I don't see any logs showing carp being denied. There are some built in rules that allow it though it seems. I did add my own though to be safe in case they disappear at some point in the future.
Built in rules:
block in log quick proto carp from (self) to any
pass quick proto carpCustom rules:
pass in quick on $SYNCIF proto carp from y.y.y.2/24 to 224.0.0.0/24 keep state label "USER_RULE"Some more info but I am sure it doesn't help. I have some IP Aliases on the carp interface that I use for services on the WAN.
Primary firewall: (36 is the main virtual IP of the wan cluster interface. The others are for services that get forwarded to the back end).
wan_vip1: flags=49 <up,loopback,running>metric 0 mtu 1500
inet x.x.x.40 netmask 0xffffffe0
inet x.x.x.41 netmask 0xffffffe0
inet x.x.x.42 netmask 0xffffffe0
inet x.x.x.43 netmask 0xffffffe0
inet x.x.x.44 netmask 0xffffffe0
inet x.x.x.36 netmask 0xffffffe0
carp: MASTER vhid 1 advbase 1 advskew 0Secondary firewall: (36 is the main virtual IP of the wan cluster interface. The others are for services that get forwarded to the back end).
wan_vip1: flags=49 <up,loopback,running>metric 0 mtu 1500
inet x.x.x.36 netmask 0xffffffe0
inet x.x.x.39 netmask 0xffffffe0
inet x.x.x.40 netmask 0xffffffe0
inet x.x.x.41 netmask 0xffffffe0
inet x.x.x.42 netmask 0xffffffe0
inet x.x.x.43 netmask 0xffffffe0
inet x.x.x.44 netmask 0xffffffe0
carp: MASTER vhid 1 advbase 1 advskew 100EDIT: made it clear that the other IPs are IP Aliases.</up,loopback,running></up,loopback,running>
-
I just noticed 39 is missing on the primary ifconfig wan_vip1 but it is in the config of the primary firewall. I am using pfsync to sync them to the secondary and it is of course on the secondary. I wonder if that is related somehow.
-
I had to disable carp on the primary and enable it again. Once I did that the 39 address appeared on the wan_vip1 interface. Carp on the primary went to MASTER for all interfaces like it did before but now the secondary has all interfaces in BACKUP status like they are supposed to be.
It seems the issue is related to the primary not getting the 39 IP Alias assigned to wan_vip1 interface for some reason. It was not the secondary that was at fault it seems.
This brings up a few things to worry about. If an IP Alias is missing on the CARP interface of the primary server then both primary and secondary will be master on that one interface and the rest of the interfaces will be MASTER on the server missing the IP Alias. I thought one of the sysctl settings forces them to all be primary on one server and not be mixed up like it was on the secondary.
The real question is what caused the primary to not get the IP Alias assigned. The config is obviously correct because disabling carp and re-enabling it fixed it.
I will reboot the systems a few times and force a reinstall of pfsense to see if I can duplicate the behavior.
Current status now that it is working…
Primary server: (Carp interfaces all in MASTER status)
wan_vip1: flags=49 <up,loopback,running>metric 0 mtu 1500
inet x.x.x.36 netmask 0xffffffe0
inet x.x.x.39 netmask 0xffffffe0
inet x.x.x.40 netmask 0xffffffe0
inet x.x.x.41 netmask 0xffffffe0
inet x.x.x.42 netmask 0xffffffe0
inet x.x.x.43 netmask 0xffffffe0
inet x.x.x.44 netmask 0xffffffe0
carp: MASTER vhid 1 advbase 1 advskew 0Secondary server: (carp interfaces all in BACKUP status)
wan_vip1: flags=49 <up,loopback,running>metric 0 mtu 1500
inet x.x.x.36 netmask 0xffffffe0
inet x.x.x.39 netmask 0xffffffe0
inet x.x.x.40 netmask 0xffffffe0
inet x.x.x 41 netmask 0xffffffe0
inet x.x.x.42 netmask 0xffffffe0
inet x.x.x.43 netmask 0xffffffe0
inet x.x.x.44 netmask 0xffffffe0
carp: BACKUP vhid 1 advbase 1 advskew 100</up,loopback,running></up,loopback,running> -
I remember there being a config problem that I fixed through the gui related to the VIP a few weeks ago (haven't looked at the firewall since then). I wonder if changing the VIP to be on a different interface (the correct one) caused the issue. This was a copied config from another site and I missed changing the interface for one (maybe 2?) of the VIPs (probably the 39 one). The previous config had IP Aliases on a non existent carp IP on the new firewalls because I changed it to the new site IPs). I wonder if changing the interface didn't work well with how pfsense removes the carp IP from the primary interface to the VIP alias and there are issues with that under some circumstances. Bringing carp up/down fixed it because it was brought up from scratch.
What makes me think this is likely is because if you look at the ifconfig output before I made the change you will notice some of the IP Aliases on the primary are out of order and at the end. After a fresh reboot they are in order now.
The bad part about this is that if you are in production and change carp settings (from the wrong interface to the correct one) you can break carp until you disable and re-enable carp. This assumes my logic is correct and what really happened of course.
I will be testing this theory. It will be easy to do.
-
I think this is related to changing a CARP IP that has IP Aliases assigned to it. If you change a CARP IP then all IP Aliases using the CARP IP will stay pointed to the old interface according to the gui and the IP aliases are broken obviously after that. If you change the CARP IP back though to the the original IP the IP Aliases still don't work until you go in and edit each one of them changing the assigned interface to the same Carp interface even though the gui still shows the IP Aliases still assigned to the correct CARP IP (which I changed to a different IP and then changed back). Ifconfig shows no IP Aliases on the CARP interface until you do this.
If you change the CARP IP to a different IP, apply changes, change it back to the original IP and save changes you will see that CARP is now master for all interfaces on the primary server but on the secondary the interface that you changed the carp IP on is in PRIMARY but the rest are BACKUP. You end up with a gui that looks like it is properly setup but the IP Alias is somehow pointed to something invalid after changing the CARP IP to something different and then back again.
I think it would be great for the gui to go and change any IP Aliases that are assigned to the CARP IP you change when you edit a CARP IP that has IP Aliases assigned to it. Just looking at gui you wouldn't know anything was wrong except CARP would be in the mixed up state that I explained.
Most people will probably not change to a different CARP IP and then back to the original IP so it would be noticeable that the interface IP listed on the IP Alias is different than the new CARP IP but if they do change it back they wouldn't know that they have to go edit each IP Alias again and reassign it to the same interface again even though the IPs are the same.
-
Looking at the IP Alias in the raw XML file it appears that the IP Aliases would probably work if I rebooted the firewall or disable and reenable carp on the primary. They are assigned to the CARP IP as wan_vip1. I just tested this and it does work after doing that so…
I guess the problem in the scenario above is that changing the CARP IP does not force any aliases assigned to it to be added back again until you manually edit each alias or you disable/reenable carp (or reboot).
There is no need to edit each Alias when you change a CARP IP because the config goes by the interface name and not by IP.
To sum up... We need IP Aliases assigned to a CARP IP to be added back when you change a CARP IP... otherwise you can cause the secondary server to get PRIMARY status for the interface that you change. You end up with primary with MASTER status for all interfaces and the secondary server as MASTER for the interface you change the CARP IP on and BACKUP for all the rest.
EDIT: I am sure it is easier said than done and I am not sure if there are still other issues. I looked at the secondary after making a bunch of changes on the primary and the secondary still had an old ip that was changed on the primary. The secondary had 1 more IP on wan_vip1 than the primary. A disable and re-enable of CARP on the secondary fixed it.