CARP broken after upgrading pfsense to 2.4.5-release (Please Help)
-
Greetings !!!
I have configured two identical hardware with pf HA in production on 5th Sept 2019.
Both system was installed with clean pfsense 2.4.4-release-p3.
Configured CARP and all went well.
one Wan on igb3
Lagg0 is configured as failover on ix0 and ix1
many vlans are configured as well.
no plugins are installed.
All worked perfectly for 7 months until upgraded...
If master goes down failover becomes master and on master resumes failover again backup...A day ago on 2th May 2020, I have upgrade pfsense software to 2.4.5-release on master system first and this issue happened...
System update was successful without any issues but failover system is now master.
after master node's upgradation vLan and wan interfaces are normal in carp status but all vlan interfaces are in INIT state.After a while I have upgraded failover node to pfsense 2.4.5-release and upgrade is successful.
Yet. Master node has that error state INIT whereas backup node show all interfaces as master and my networking is working.What went wrong? why master node's all vlans are in INIT state and unable to resume it's master state? How to fix it?
I had backup configuration prior to upgrade so I have also clean installed pfsense 2.4.5-release on master node and restored the configuration but their is no change all vlans are in INIT state but Lan and Wan are master.
I have put master node in Persistent CARP maintenance mode.
Failover node is now master and my network is working perfectly.If i disable and re-enable vlan interface than CARP state for that vlan changes from INIT > backup in maintenance mode. So i have performed same for all vlans and exited maintenance mode. My network stopped working, so i have to force reset.
After reset all back to normal, Backup node resumed master state.
Note: If I disable and re-enable vlan interface than CARP state for that vlan changes from INIT > backup in maintenance mode and after restart it's back to INIT state.
Faced no such issues in 7 months. Only after upgrade.
Please guide me, I will really appreciate.
Thanking you.
*** Welcome to pfSense 2.4.5-RELEASE (amd64) on DCGL_FIR_001 ***
SOFTCALL (wan) -> igb0 -> v4: 192.168.0.2/24
LAN (lan) -> lagg0 -> v4: 172.16.1.2/24
SYNC (opt1) -> re0 -> v4: 10.1.10.2/24
VLAN2 (opt3) -> lagg0.2 -> v4: 172.16.2.252/24
VLAN3 (opt4) -> lagg0.3 -> v4: 172.16.3.252/24
VLAN4 (opt5) -> lagg0.4 -> v4: 172.16.4.252/24
VLAN5 (opt6) -> lagg0.5 -> v4: 172.16.5.252/24
VLAN9 (opt7) -> lagg0.9 -> v4: 172.16.9.252/24
VLAN10 (opt8) -> lagg0.10 -> v4: 172.16.10.252/24
VLAN11 (opt9) -> lagg0.11 -> v4: 172.16.11.252/24
VLAN12 (opt10) -> lagg0.12 -> v4: 172.16.12.252/24
VLAN13 (opt11) -> lagg0.13 -> v4: 172.16.13.252/24
VLAN14 (opt12) -> lagg0.14 -> v4: 172.16.14.252/24
VLAN15 (opt13) -> lagg0.15 -> v4: 172.16.15.252/24
VLAN16 (opt14) -> lagg0.16 -> v4: 172.16.16.252/24
VLAN17 (opt15) -> lagg0.17 -> v4: 172.16.17.252/24
VLAN18 (opt16) -> lagg0.18 -> v4: 172.16.18.252/24
VLAN19 (opt17) -> lagg0.19 -> v4: 172.16.19.252/24
VLAN20 (opt18) -> lagg0.20 -> v4: 172.16.20.252/24
VLAN21 (opt19) -> lagg0.21 -> v4: 172.16.21.252/24
VLAN22 (opt20) -> lagg0.22 -> v4: 172.16.22.252/24
VLAN50 (opt21) -> lagg0.50 -> v4: 192.168.50.252/24
VLAN51 (opt22) -> lagg0.51 -> v4: 192.168.51.252/24
VLAN52 (opt23) -> lagg0.52 -> v4: 192.168.52.252/24
VLAN53 (opt25) -> lagg0.53 -> v4: 192.168.53.252/24
VLAN6 (opt26) -> lagg0.6 -> v4: 172.16.6.252/24
WAN (opt27) -> igb3 -> v4: 192.168.10.2/24[2.4.5-RELEASE][root@DCGL_FIR_001.datarpgx.firewall]/root: sysctl -a | grep carp
device carp
net.inet.carp.ifdown_demotion_factor: 240
net.inet.carp.senderr_demotion_factor: 0
net.inet.carp.demotion: 0
net.inet.carp.log: 1
net.inet.carp.preempt: 1
net.inet.carp.dscp: 56
net.inet.carp.allow: 1
net.pfsync.carp_demotion_factor: 0*** Welcome to pfSense 2.4.5-RELEASE (amd64) on DCGL_FIR_002 ***
SOFTCALL (wan) -> igb0 -> v4: 192.168.0.3/24
LAN (lan) -> lagg0 -> v4: 172.16.1.3/24
SYNC (opt1) -> re0 -> v4: 10.1.10.3/24
VLAN2 (opt3) -> lagg0.2 -> v4: 172.16.2.253/24
VLAN3 (opt4) -> lagg0.3 -> v4: 172.16.3.253/24
VLAN4 (opt5) -> lagg0.4 -> v4: 172.16.4.253/24
VLAN5 (opt6) -> lagg0.5 -> v4: 172.16.5.253/24
VLAN9 (opt7) -> lagg0.9 -> v4: 172.16.9.253/24
VLAN10 (opt8) -> lagg0.10 -> v4: 172.16.10.253/24
VLAN11 (opt9) -> lagg0.11 -> v4: 172.16.11.253/24
VLAN12 (opt10) -> lagg0.12 -> v4: 172.16.12.253/24
VLAN13 (opt11) -> lagg0.13 -> v4: 172.16.13.253/24
VLAN14 (opt12) -> lagg0.14 -> v4: 172.16.14.253/24
VLAN15 (opt13) -> lagg0.15 -> v4: 172.16.15.253/24
VLAN16 (opt14) -> lagg0.16 -> v4: 172.16.16.253/24
VLAN17 (opt15) -> lagg0.17 -> v4: 172.16.17.253/24
VLAN18 (opt16) -> lagg0.18 -> v4: 172.16.18.253/24
VLAN19 (opt17) -> lagg0.19 -> v4: 172.16.19.253/24
VLAN20 (opt18) -> lagg0.20 -> v4: 172.16.20.253/24
VLAN21 (opt19) -> lagg0.21 -> v4: 172.16.21.253/24
VLAN22 (opt20) -> lagg0.22 -> v4: 172.16.22.253/24
VLAN50 (opt21) -> lagg0.50 -> v4: 192.168.50.253/24
VLAN51 (opt22) -> lagg0.51 -> v4: 192.168.51.253/24
VLAN52 (opt23) -> lagg0.52 -> v4: 192.168.52.253/24
VLAN53 (opt25) -> lagg0.53 -> v4: 192.168.53.253/24
VLAN6 (opt26) -> lagg0.6 -> v4: 172.16.6.253/24
WAN (opt27) -> igb3 -> v4: 192.168.10.3/24[2.4.5-RELEASE][root@DCGL_FIR_002.datarpgx.firewall]/root: sysctl -a | grep carp
device carp
net.inet.carp.ifdown_demotion_factor: 240
net.inet.carp.senderr_demotion_factor: 0
net.inet.carp.demotion: 0
net.inet.carp.log: 1
net.inet.carp.preempt: 1
net.inet.carp.dscp: 56
net.inet.carp.allow: 1
net.pfsync.carp_demotion_factor: 0 -
Update :
Any changes done on Master Node gets updated on Backup Node;
Master Node (ifconfig);
lagg0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1280
options=8400b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWTSO>
ether 90:e2:ba:89:b3:3c
inet 172.16.1.2 netmask 0xffffff00 broadcast 172.16.1.255
inet 172.16.1.1 netmask 0xffffff00 broadcast 172.16.1.255 vhid 2
inet6 fe80::1:1%lagg0 prefixlen 64 scopeid 0xc
nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
media: Ethernet autoselect
status: active
carp: MASTER vhid 2 advbase 1 advskew 0
groups: lagg
laggproto failover lagghash l2,l3,l4
laggport: ix0 flags=5<MASTER,ACTIVE>
laggport: ix1 flags=0<>lagg0.3: flags=8903<UP,BROADCAST,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
ether 90:e2:ba:89:b3:3c
inet6 fe80::92e2:baff:fe89:b33c%lagg0.3 prefixlen 64 tentative scopeid 0xe
inet 172.16.3.252 netmask 0xffffff00 broadcast 172.16.3.255
inet 172.16.3.254 netmask 0xffffff00 broadcast 172.16.3.255 vhid 4
nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
vlan: 0 vlanpcp: 0 parent interface: <none>
carp: INIT vhid 4 advbase 1 advskew 0
groups: vlanlagg0.6: flags=8903<UP,BROADCAST,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
ether 90:e2:ba:89:b3:3c
inet6 fe80::92e2:baff:fe89:b33c%lagg0.6 prefixlen 64 tentative scopeid 0x23
inet 172.16.6.252 netmask 0xffffff00 broadcast 172.16.6.255
inet 172.16.6.254 netmask 0xffffff00 broadcast 172.16.6.255 vhid 29
nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
vlan: 0 vlanpcp: 0 parent interface: <none>
carp: INIT vhid 29 advbase 1 advskew 0
groups: vlanBackup Node (ifconfig);
lagg0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=8500b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER,VLAN_HWTSO>
ether 90:e2:ba:89:ad:20
inet 172.16.1.3 netmask 0xffffff00 broadcast 172.16.1.255
inet 172.16.1.1 netmask 0xffffff00 broadcast 172.16.1.255 vhid 2
inet6 fe80::1:1%lagg0 prefixlen 64 scopeid 0xc
nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
media: Ethernet autoselect
status: active
carp: MASTER vhid 2 advbase 1 advskew 100
groups: lagg
laggproto failover lagghash l2,l3,l4
laggport: ix0 flags=5<MASTER,ACTIVE>
laggport: ix1 flags=0<>lagg0.3: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
ether 90:e2:ba:89:ad:20
inet6 fe80::92e2:baff:fe89:ad20%lagg0.3 prefixlen 64 scopeid 0xe
inet 172.16.3.253 netmask 0xffffff00 broadcast 172.16.3.255
inet 172.16.3.254 netmask 0xffffff00 broadcast 172.16.3.255 vhid 4
nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
media: Ethernet autoselect
status: active
vlan: 3 vlanpcp: 0 parent interface: lagg0
carp: MASTER vhid 4 advbase 1 advskew 100
groups: vlanlagg0.6: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
ether 90:e2:ba:89:ad:20
inet6 fe80::92e2:baff:fe89:ad20%lagg0.6 prefixlen 64 scopeid 0x23
inet 172.16.6.253 netmask 0xffffff00 broadcast 172.16.6.255
inet 172.16.6.254 netmask 0xffffff00 broadcast 172.16.6.255 vhid 29
nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
media: Ethernet autoselect
status: active
vlan: 6 vlanpcp: 0 parent interface: lagg0
carp: MASTER vhid 29 advbase 1 advskew 100
groups: vlanComparison shows that on Master node all vlans have this same setting;
vlan: 0 vlanpcp: 0 parent interface: <none>whereas on Backup node all vlans have different settings that seems correct;
vlan: 3 vlanpcp: 0 parent interface: lagg0
vlan: 6 vlanpcp: 0 parent interface: lagg0 -
Re-enabled vlan 2 on Master;
lagg0.2: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1280
ether 90:e2:ba:89:b3:3c
inet6 fe80::92e2:baff:fe89:b33c%lagg0.2 prefixlen 64 scopeid 0xd
inet 172.16.2.252 netmask 0xffffff00 broadcast 172.16.2.255
inet 172.16.2.254 netmask 0xffffff00 broadcast 172.16.2.255 vhid 3
nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
media: Ethernet autoselect
status: active
vlan: 2 vlanpcp: 0 parent interface: lagg0
carp: BACKUP vhid 3 advbase 1 advskew 254
groups: vlanvlan: 0 vlanpcp: 0 parent interface: <none>
changed to
vlan: 2 vlanpcp: 0 parent interface: lagg0 -
The steps to resolve this issue...
- Created Maintenance Interface class 3 address dhcp enabled on master (got sync with slave node)
- Backup Full Configuration of Slave Node
- unplug all interfaces (LAN, Wan, Sync)
- Restored Slave config to Master node using maintenance interface
- Changed all interface IP addresses of all wan, lan, vlan (Previous master node's Addresses)
- Changed all virtual IP's Skew from 100 to 0
- Changed all DHCP enabled Failover peer IP addresses
- Reboot
- Enter persistent carp maintenance mode
- Plugged-in lan (Lagg) interface to check
note: It worked and all carp interface status changed from INIT to Backup. - Plugged-in all cabled wan, sync
- Master node's H.A enabled and cinfigured
Note: sync wasn't working not with admin account may be because i have changed sync password (for sync account). So changed Sync account password on both master and slave.
rebooted Master node and it worked.
Tested. All is Good now.
I'm still not sure what went wrong while upgrading Master node whereas slave node worked perfectly after upgrade. Clean installation and restoring previous saved configuration has also failed.
Anyways.
Thank you very much pfSense and netgate team for making such a wonderful firewall and keeping it open source.