RESOLVED: CARP not failing back and other weird behaviour on pfSense 2.2
-
Abbreviated packet capture of attempt to force failback to fw1 using tcpdump -i lagg0 -n proto CARP
fw2 is master, though it shouldn't be:
16:05:46.084051 IP 213.x.x.114 > 224.0.0.18: VRRPv2, Advertisement, vrid 5, prio 240, authtype none, intvl 1s, length 36 16:05:46.084058 IP 213.x.x.114 > 224.0.0.18: VRRPv2, Advertisement, vrid 6, prio 240, authtype none, intvl 1s, length 36 16:05:46.084065 IP 213.x.x.114 > 224.0.0.18: VRRPv2, Advertisement, vrid 7, prio 240, authtype none, intvl 1s, length 36
Carp disabled on fw2, fw1 begins advertising:
16:05:47.145275 IP 213.x.x.113 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 240, authtype none, intvl 1s, length 36 16:05:48.039994 IP 213.x.x.114 > 224.0.0.18: VRRPv2, Advertisement, vrid 7, prio 240, authtype none, intvl 1s, length 36 16:05:48.040009 IP 213.x.x.114 > 224.0.0.18: VRRPv2, Advertisement, vrid 6, prio 240, authtype none, intvl 1s, length 36 16:05:48.040016 IP 213.x.x.114 > 224.0.0.18: VRRPv2, Advertisement, vrid 5, prio 240, authtype none, intvl 1s, length 36 16:05:48.040055 IP 213.x.x.114 > 224.0.0.18: VRRPv2, Advertisement, vrid 9, prio 240, authtype none, intvl 1s, length 36 16:05:48.040061 IP 213.x.x.114 > 224.0.0.18: VRRPv2, Advertisement, vrid 8, prio 240, authtype none, intvl 1s, length 36 16:05:48.169258 IP 213.x.x.113 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 240, authtype none, intvl 1s, length 36 16:05:49.978992 IP 213.x.x.114 > 224.0.0.18: VRRPv2, Advertisement, vrid 8, prio 240, authtype none, intvl 1s, length 36 16:05:49.979005 IP 213.x.x.114 > 224.0.0.18: VRRPv2, Advertisement, vrid 9, prio 240, authtype none, intvl 1s, length 36 16:05:49.979012 IP 213.x.x.114 > 224.0.0.18: VRRPv2, Advertisement, vrid 6, prio 240, authtype none, intvl 1s, length 36 16:05:49.979018 IP 213.x.x.114 > 224.0.0.18: VRRPv2, Advertisement, vrid 7, prio 240, authtype none, intvl 1s, length 36 16:05:50.108254 IP 213.x.x.113 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 240, authtype none, intvl 1s, length 36 16:05:51.071274 IP 213.x.x.113 > 224.0.0.18: VRRPv2, Advertisement, vrid 5, prio 240, authtype none, intvl 1s, length 36 16:05:51.917993 IP 213.x.x.114 > 224.0.0.18: VRRPv2, Advertisement, vrid 9, prio 240, authtype none, intvl 1s, length 36 16:05:51.918007 IP 213.x.x.114 > 224.0.0.18: VRRPv2, Advertisement, vrid 8, prio 240, authtype none, intvl 1s, length 36
fw1 becomes master:
16:05:52.050284 IP 213.x.x.113 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 240, authtype none, intvl 1s, length 36 16:05:52.085261 IP 213.x.x.113 > 224.0.0.18: VRRPv2, Advertisement, vrid 5, prio 240, authtype none, intvl 1s, length 36 16:05:52.994273 IP 213.x.x.113 > 224.0.0.18: VRRPv2, Advertisement, vrid 7, prio 240, authtype none, intvl 1s, length 36
CARP re-enabled on fw2:
16:06:17.389273 IP 213.x.x.113 > 224.0.0.18: VRRPv2, Advertisement, vrid 7, prio 240, authtype none, intvl 1s, length 36 16:06:17.401252 IP 213.x.x.113 > 224.0.0.18: VRRPv2, Advertisement, vrid 5, prio 240, authtype none, intvl 1s, length 36 16:06:17.596993 IP 213.x.x.114 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 240, authtype none, intvl 1s, length 36 16:06:18.326009 IP 213.x.x.114 > 224.0.0.18: VRRPv2, Advertisement, vrid 5, prio 240, authtype none, intvl 1s, length 36 16:06:19.232261 IP 213.x.x.113 > 224.0.0.18: VRRPv2, Advertisement, vrid 9, prio 240, authtype none, intvl 1s, length 36 16:06:19.232275 IP 213.x.x.113 > 224.0.0.18: VRRPv2, Advertisement, vrid 8, prio 240, authtype none, intvl 1s, length 36 16:06:19.328260 IP 213.x.x.113 > 224.0.0.18: VRRPv2, Advertisement, vrid 7, prio 240, authtype none, intvl 1s, length 36 16:06:19.328273 IP 213.x.x.113 > 224.0.0.18: VRRPv2, Advertisement, vrid 6, prio 240, authtype none, intvl 1s, length 36
fw2 becomes master again:
16:06:19.382004 IP 213.x.x.114 > 224.0.0.18: VRRPv2, Advertisement, vrid 6, prio 240, authtype none, intvl 1s, length 36 16:06:19.556993 IP 213.x.x.114 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 240, authtype none, intvl 1s, length 36 16:06:19.717988 IP 213.x.x.114 > 224.0.0.18: VRRPv2, Advertisement, vrid 5, prio 240, authtype none, intvl 1s, length 36
The above sequence of CARP packets is visible on both firewalls.
-
Log of the failback to fw2 after CARP was re-enabled:
fw1:
Feb 19 16:06:14 kernel: carp: VHID 1@lagg0: MASTER -> BACKUP (more frequent advertisement received) Feb 19 16:06:15 kernel: carp: VHID 2@lagg1_vlan2: MASTER -> BACKUP (more frequent advertisement received) Feb 19 16:06:15 php-fpm[56729]: /rc.carpbackup: Carp cluster member "213.x.x.115 - CARP0 WAN (1@lagg0)" has resumed the state "BACKUP" for vhid 1@lagg0 Feb 19 16:06:16 kernel: carp: VHID 3@lagg1_vlan3: MASTER -> BACKUP (more frequent advertisement received) Feb 19 16:06:16 php-fpm[56729]: /rc.carpbackup: Carp cluster member "192.168.2.254 - PROD LAN CARP (2@lagg1_vlan2)" has resumed the state "BACKUP" for vhid 2@lagg1_vlan2 Feb 19 16:06:17 kernel: carp: VHID 4@lagg1: MASTER -> BACKUP (more frequent advertisement received) Feb 19 16:06:17 php-fpm[56729]: /rc.carpbackup: Carp cluster member "192.168.3.254 - DEV VLAN CARP (3@lagg1_vlan3)" has resumed the state "BACKUP" for vhid 3@lagg1_vlan3 Feb 19 16:06:18 check_reload_status: Carp backup event Feb 19 16:06:18 kernel: carp: VHID 5@lagg0: MASTER -> BACKUP (more frequent advertisement received) Feb 19 16:06:18 php-fpm[56729]: /rc.carpbackup: Carp cluster member "192.168.5.254 - MANAGEMENT CARP (4@lagg1)" has resumed the state "BACKUP" for vhid 4@lagg1 Feb 19 16:06:19 check_reload_status: Carp backup event Feb 19 16:06:19 kernel: carp: VHID 6@lagg0: MASTER -> BACKUP (more frequent advertisement received) Feb 19 16:06:19 php-fpm[56729]: /rc.carpbackup: Carp cluster member "213.x.x.116 - (5@lagg0)" has resumed the state "BACKUP" for vhid 5@lagg0 Feb 19 16:06:20 check_reload_status: Carp backup event Feb 19 16:06:20 kernel: carp: VHID 7@lagg0: MASTER -> BACKUP (more frequent advertisement received) Feb 19 16:06:20 php-fpm[56729]: /rc.carpbackup: Carp cluster member "213.x.x.117 - (6@lagg0)" has resumed the state "BACKUP" for vhid 6@lagg0 Feb 19 16:06:21 kernel: carp: VHID 8@lagg0: MASTER -> BACKUP (more frequent advertisement received) Feb 19 16:06:21 php-fpm[56729]: /rc.carpbackup: Carp cluster member "213.x.x.118 - (7@lagg0)" has resumed the state "BACKUP" for vhid 7@lagg0 Feb 19 16:06:22 check_reload_status: Carp backup event Feb 19 16:06:22 kernel: carp: VHID 9@lagg0: MASTER -> BACKUP (more frequent advertisement received) Feb 19 16:06:22 php-fpm[56729]: /rc.carpbackup: Carp cluster member "213.x.x.119 - (8@lagg0)" has resumed the state "BACKUP" for vhid 8@lagg0 Feb 19 16:06:23 php-fpm[56729]: /rc.carpbackup: Carp cluster member "213.x.x.120 - (9@lagg0)" has resumed the state "BACKUP" for vhid 9@lagg0
fw2:
Feb 19 16:06:17 check_reload_status: Carp master event Feb 19 16:06:17 kernel: carp: VHID 4@lagg1: BACKUP -> MASTER (master down) Feb 19 16:06:17 php-fpm[90842]: /rc.carpmaster: Carp cluster member "192.168.3.254 - DEV VLAN CARP (3@lagg1_vlan3)" has resumed the state "MASTER" for vhid 3@lagg1_vlan3 Feb 19 16:06:17 check_reload_status: Carp backup event Feb 19 16:06:17 kernel: carp: VHID 8@lagg0: INIT -> BACKUP Feb 19 16:06:18 check_reload_status: Carp master event Feb 19 16:06:18 kernel: carp: VHID 5@lagg0: BACKUP -> MASTER (master down) Feb 19 16:06:18 php-fpm[91664]: /rc.carpbackup: Carp cluster member "213.x.x.118 - (7@lagg0)" has resumed the state "BACKUP" for vhid 7@lagg0 Feb 19 16:06:18 php-fpm[90842]: /rc.carpmaster: Carp cluster member "192.168.5.254 - MANAGEMENT CARP (4@lagg1)" has resumed the state "MASTER" for vhid 4@lagg1 Feb 19 16:06:18 check_reload_status: Carp backup event Feb 19 16:06:18 kernel: carp: VHID 9@lagg0: INIT -> BACKUP Feb 19 16:06:19 php-fpm[91664]: /rc.carpbackup: Carp cluster member "213.x.x.119 - (8@lagg0)" has resumed the state "BACKUP" for vhid 8@lagg0 Feb 19 16:06:19 check_reload_status: Carp master event Feb 19 16:06:19 kernel: carp: VHID 6@lagg0: BACKUP -> MASTER (master down) Feb 19 16:06:19 php-fpm[90842]: /rc.carpmaster: Carp cluster member "213.x.x.116 - (5@lagg0)" has resumed the state "MASTER" for vhid 5@lagg0 Feb 19 16:06:19 kernel: carp: demoted by 240 to 480 (pfsync bulk start) Feb 19 16:06:20 kernel: carp: demoted by -240 to 240 (pfsync bulk done) Feb 19 16:06:20 php-fpm[91664]: /rc.carpbackup: Carp cluster member "213.x.x.120 - (9@lagg0)" has resumed the state "BACKUP" for vhid 9@lagg0 Feb 19 16:06:20 check_reload_status: Carp master event Feb 19 16:06:20 kernel: carp: VHID 7@lagg0: BACKUP -> MASTER (master down) Feb 19 16:06:20 php-fpm[90842]: /rc.carpmaster: Carp cluster member "213.x.x.117 - (6@lagg0)" has resumed the state "MASTER" for vhid 6@lagg0 Feb 19 16:06:20 php-fpm[80124]: /carp_status.php: waiting for pfsync... Feb 19 16:06:21 check_reload_status: Carp master event Feb 19 16:06:21 kernel: carp: VHID 8@lagg0: BACKUP -> MASTER (master down) Feb 19 16:06:21 php-fpm[91664]: /rc.carpmaster: Carp cluster member "213.x.x.118 - (7@lagg0)" has resumed the state "MASTER" for vhid 7@lagg0 Feb 19 16:06:22 kernel: carp: VHID 9@lagg0: BACKUP -> MASTER (master down) Feb 19 16:06:22 check_reload_status: Carp master event Feb 19 16:06:22 php-fpm[91664]: /rc.carpmaster: Carp cluster member "213.x.x.119 - (8@lagg0)" has resumed the state "MASTER" for vhid 8@lagg0 Feb 19 16:06:23 php-fpm[91664]: /rc.carpmaster: Carp cluster member "213.x.x.120 - (9@lagg0)" has resumed the state "MASTER" for vhid 9@lagg0 Feb 19 16:06:52 php-fpm[80124]: /carp_status.php: pfsync done in 30 seconds. Feb 19 16:06:52 php-fpm[80124]: /carp_status.php: Configuring CARP settings finalize...
Why would fw2 be advertising more frequently when advskew is 0 on fw1 and 100 on fw2?
-
Hi!
I've a CARP system running on nearly an equal or equal hardware, DELL R210 II with a quad port Intel igb NIC card. I've upgraded them from 2.1.4 to 2.2 three weeks ago, and it works without an issue since that time.
I have just modified two boot parameters for the interfaces:
kern.ipc.nmbclusters="131072" hw.bce1.fc_setting=0
However, I have no lagg configuration.
I've all NICs on the Intel card and one onboard (for sync) in use. All of the four interfaces are connected to switches.
I tested failover by pulling out a network cable from master, and all VIPs moved to backup on FW1 and these on FW2 became master. After plug in the cable FW1 became master again.Maybe CARP in 2.2 have trouble in combination with lagg interfaces. There are other unsolved threads here describing similar problems.
-
I also have a weird issue involving CARP + LAGG where my system ends up in a split-brain mode. The backup unit has MASTER on the LAN while the master unit has MASTER on WAN and DMZ.
Found some messages I'd never seen before in dmesg output:
carp: demoted by 240 to 480 (pfsync bulk start) carp: demoted by -240 to 240 (pfsync bulk fail) carp: demoted by 0 to 240 (sysctl) carp: demoted by 0 to 240 (sysctl) carp: demoted by 240 to 480 (pfsync bulk start) carp: demoted by -240 to 240 (pfsync bulk fail) carp: demoted by 240 to 480 (pfsync bulk start) carp: demoted by -240 to 240 (pfsync bulk fail) carp: demoted by 240 to 480 (pfsync bulk start) carp: demoted by -240 to 240 (pfsync bulk fail)
and similar stuff in system.log
[2.2-RELEASE][root@pfsense1]/root: tail /var/log/system.log Feb 18 15:50:11 pfsense1 check_reload_status: Syncing firewall Feb 18 15:50:11 pfsense1 kernel: carp: demoted by 240 to 480 (pfsync bulk start) Feb 18 15:50:12 pfsense1 php-fpm[29759]: /system_hasync.php: waiting for pfsync... Feb 18 15:50:12 pfsense1 php-fpm[29759]: /system_hasync.php: pfsync done in 0 seconds. Feb 18 15:50:12 pfsense1 php-fpm[29759]: /system_hasync.php: Configuring CARP settings finalize... Feb 18 15:50:12 pfsense1 php-fpm[39031]: /rc.filter_synchronize: Beginning XMLRPC sync to https://10.21.2.252:443. Feb 18 15:50:12-pfsense1 php-fpm[39031]: /rc.filter_synchronize: XMLRPC sync successfully completed with https://10.21.2.252:443. Feb 18 15:50:16 pfsense1 php-fpm[39031]: /rc.filter_synchronize: Filter sync successfully completed with https://10.21.2.252:443. Feb 18 15:51:16 pfsense1 kernel: carp: demoted by -240 to 240 (pfsync bulk fail)
this setup did work fine before the upgrade to 2.2 and I have identical hardware working at 2 sites (also running 2.2) while this site refuses to pysync correctly.
I use a 4x Intel NIC with em driver and 2x onboard using igb just in case. pfsync NIC is em2.
Currently the xml sync is working so my config is propagating to the backup but the CARP state sync is failing so my MASTER always says:
CARP has detected a problem and this unit has been demoted to BACKUP status.
Check link status on all interfaces with configured CARP VIPs.Which I gather it knows because this sysctl is not zero.
net.inet.carp.demotion: 240
This is a remote site and some switch may have been replaced without my knowledge so I'm looking for the magic tcpdump commands to run to verify the CARP traffic now.
-
My testing with tcpdump -i $NIC -ttt -n proto CARP seems to show the backup sees the VRRPv2 Advertisement on all 3 CARP interface I use.
Is there a way I can force clear the net.inet.carp.demotion value back to zero??
It seems pretty convinced that it's in a bad state:
[2.2-RELEASE][root@la-pfsense1]/root: sysctl net.inet.carp.demotion=0 net.inet.carp.demotion: 240 -> 240
-
In case anyone else is looking…
sysctl net.inet.carp.demotion=-240
net.inet.carp.demotion: 240 -> 0
Now I'm just waiting for someone to be on site before I restore the backup.
-
So my problem seems to be due to the crossover cable I used which was supposed to just be a temporary measure until something actually got put in the DMZ.
After setting the backup to maintenance mode I applied the previous sysctl -240. I then turned off maintenance mode on the backup and things were in a good state.
As a test I rebooted the backup unit but almost immediately saw the message on the master unit's Status -> CARP web UI:
CARP has detected a problem and this unit has been demoted to BACKUP status.
Check link status on all interfaces with configured CARP VIPs.then found this in master unit's dmesg:
carp: demoted by 240 to 240 (interface down) em3: link state changed to DOWN carp: VHID 222@em3: INIT -> BACKUP carp: demoted by -240 to 0 (interface up) em3: link state changed to UP carp: demoted by 240 to 240 (pfsync bulk start) carp: VHID 222@em3: BACKUP -> MASTER (master down) ifa_add_loopback_route: insertion failed: 17
So I quickly reapplied the sysctl -240 and the master stayed in charge after the backup finished it's reboot and now I know I need to put a switch in there.
Sorry to hijack your thread a bit there viragomann… I hope this helps somebody!
-
FWIW the problem is ongoing and I managed to capture the following errors at boot time
sysctl: oid 'hw.bce.tso_enable' is a read only tunable
sysctl: Tunable values are set in /boot/loader.conf
sysctl: oid 'hw.igb.num_queues' is a read only tunable
sysctl: Tunable values are set in /boot/loader.confI noticed that https://doc.pfsense.org/index.php/Tuning_and_Troubleshooting_Network_Cards was changed on 16/02/15 so I have changed the above settings by dropping hw.igb.num_queues entirely and putting hw.bce.tso_enable in /boot/loader.conf.local. That cleaned up the errors but the CARP VIPs are still failing over to fw2 and fw1 is still logging that more frequent advertisements are being received :-\
carp: VHID 1@lagg0: MASTER -> BACKUP (more frequent advertisement received)
carp: VHID 2@lagg1_vlan2: MASTER -> BACKUP (more frequent advertisement received)
arp: 192.168.2.254 moved from xxxx to xxxx on lagg1_vlan2
carp: VHID 3@lagg1_vlan3: MASTER -> BACKUP (more frequent advertisement received)
carp: VHID 4@lagg1: MASTER -> BACKUP (more frequent advertisement received)
carp: VHID 5@lagg0: MASTER -> BACKUP (more frequent advertisement received)
carp: VHID 6@lagg0: MASTER -> BACKUP (more frequent advertisement received)
carp: VHID 7@lagg0: MASTER -> BACKUP (more frequent advertisement received)
carp: VHID 8@lagg0: MASTER -> BACKUP (more frequent advertisement received)
carp: VHID 9@lagg0: MASTER -> BACKUP (more frequent advertisement received)Any thoughts on why that might be if I can see the advertisements on both sides as in the packet capture above?
-
Dear All,
similar issues did appear with igb NICs on my two sets of duplicated servers Intel(R) Atom(TM) CPU C2758 @ 2.40GHz wit 16 GB RAM (reported https://forum.pfsense.org/index.php?topic=89132.0).
My /boot/loader.conf is:
autoboot_delay="3"
vm.kmem_size="536870912"
vm.kmem_size_max="1073741824"
kern.ipc.nmbclusters="1000000"
comconsole_speed="9600"
hw.usb.no_pf="1"
hw.igb.fc_setting=0Regards,
Michael Schefczyk
-
To add closure to this issue, the problem went away by resetting sysctl net.inet.carp.demotion from 240 to 0 with:
sysctl net.inet.carp.demotion=-240
sysctl net.inet.carp.demotion is essentially a penalty against the advskew settings. Returning this to 0 made the VIPs stable and removed the warning from the CARP status page, though it would recur following a reboot.
According to https://forum.pfsense.org/index.php?topic=89132.msg496865#msg496865 the problem is caused when using CARP on a LAGG. When the LAGG is initialised it loses some CARP advertisements and causes net.inet.carp.demotion to be increased by the value of net.inet.carp.senderr_demotion_factor (240).
Setting:
net.inet.carp.senderr_demotion_factor=0
means that this issue no longer occurs when at boot time and is therefore resolved permanently.