RESOLVED: CARP not failing back and other weird behaviour on pfSense 2.2



  • Hi everyone

    I have some weird issues with pfSense 2.2, originally installed as 2.1 and upgraded via 2.1.4 and 2.1.5. Everything appears to have been working normally before the upgrade to 2.2, though it was tested working normally with 2.1 then handed over to the owner and I've not touched it much since then. I'm not sure if the failover has been tested since then but there were no reports of any problems prior to the upgrade to 2.2.

    There are two firewalls running on Dell R210 IIs, with two onboard bce NICs and a four port igb card. Interfaces on both are set up as:

    WAN: lagg0 on bce0 and igb0
    LAN: lagg1 on bce1 and igb1
    VLAN: lagg1_vlan2 on lagg1
    VLAN: lagg1_vlan3 on lagg1
    pfsync: lagg2 on igb2 and igb3

    IPs are

    fw1:

    WAN (wan)      -> lagg0      -> v4: 213.x.x.113/28
    LAN (lan)      -> lagg1      -> v4: 192.168.5.253/24
    PFSYNC (opt1)  -> lagg2      -> v4: 172.16.254.254/29
    PRODUCTIONVLAN (opt2) -> lagg1_vlan2 -> v4: 192.168.2.253/24
    DEVELOPMENTVLAN (opt3) -> lagg1_vlan3 -> v4: 192.168.3.253/24

    fw2:
    WAN (wan)      -> lagg0      -> v4: 213.x.x.114/28
    LAN (lan)      -> lagg1      -> v4: 192.168.5.252/24
    PFSYNC (opt1)  -> lagg2      -> v4: 172.16.254.253/29
    PRODUCTIONVLAN (opt2) -> lagg1_vlan2 -> v4: 192.168.2.252/24
    DEVELOPMENTVLAN (opt3) -> lagg1_vlan3 -> v4: 192.168.3.252/24

    IPv6 configuration is set to none on all interfaces and allow IPv6 traffic is unticked under System > Advanced > Networking.

    The primary issue appears to be CARP VIP instability. VIPs on the WAN side would slowly move from fw1 to fw2 over a period of around 10-15 mins, but the LAN side would remain on fw1. I checked sysctl and found preemt is enabled on both sides (inet.inet.carp.preempt: 1). The colo provider only supplied one uplink per firewall, the firewalls were built with two in mind, so only the bce0 interfaces were connected in lagg0. Investigating, I found that the bce0 interfaces had autonegotiated down to half-duplex, so I worked with the colo provider to force the upstream switches into full-duplex mode. I haven't been able to set this in a persistent way on the pfSense end (more on that later perhaps), but I can at least set it manually, though it doesn't survive a reboot. In any case, this did not solve the problem. I didn't keep a packet capture from before this was set so I can't compare, though I could see CARP advertisements being sent by the master and received by the backup both before and after.

    I set the following on both firewalls as recommended at https://doc.pfsense.org/index.php/Tuning_and_Troubleshooting_Network_Cards and elsewhere:

    kern.ipc.nmbclusters 131072
    hw.bce.tso_enable 0
    hw.pci.enable_msix 0
    hw.igb.num_queues 1

    MBUF usage is 20% or less on both.

    hw.bce.tso_enable 0 did appear to stabilise things, so after 24 hours I decided to move the uplinks to igb0 as Intel cards are reputed to be less problematic and I wanted to guarantee stability using either interface. Things remained stable afterwards, so I decided to request the forced speed and duplex be removed on the switch ports, since I couldn't set it in a way that would survive a reboot and did the same on both firewalls. Again the interfaces became half-duplex. I have asked the colo provider twice if they can see any reason for this but they have not responded in either case and we hurriedly returned to forcing 100baseTX full-duplex at each end, however the VIPs have not been stable since.

    Again, the WAN VIPs moved from fw1 to fw2, though the LAN VIPs remained on fw1. If I disable CARP on fw1, the LAN VIPs move to fw2. When enabling CARP again on fw1, none of the VIPs fail back even though advbase is 1 on both, advskew is 0 on fw1 and 100 on fw2.

    When the VIPs have been unstable and currently while they're active on fw2, the CARP status page on both firewalls report "CARP has detected a problem and this unit has been demoted to BACKUP status. Check link status on all interfaces with configured CARP VIPs." Even when the VIPs are on fw1. In the periods while the VIPs were stable on fw1, this message was not visible on fw1's CARP status page.

    To summarise, the issues are as follows:

    1. The CARP VIPs are unstable on fw1 and migrate to fw2. Disabling CARP on fw2 results in them migrating back to fw1, but they move to fw2 again when CARP is enabled.
    2. This sometimes results in LAN VIPs sticking to fw1 when the WAN VIPs have moved to fw2 even though preempt is enabled. I have to disable CARP on fw1 to get them to move to fw2.
    3. The VIPs are not failing back to fw1 when it is up and available.

    The fact that the LAN VIPs don't migrate and there are no packet errors on the LAN interfaces suggests to me that this is a problem specific to the WAN side, where packet errors and collisions were observed prior to forcing the speed and duplex settings.

    The are other minor issues to deal with (how to force full-duplex permanently, persistent maintenance mode not doing what I expect, occasional firewall blocks for allowed traffic), but I think it's better to address those another time.

    At this point I'm inclined to the think it's one of the following:

    1. A non-obvious config setting.
    2. Problems communicating between the WAN interfaces over the colo providers uplink switch, though I can see CARP advertisments.
    3. A bug introduced by the upgrade to 2.2. I'm not aware that these problems were occurring prior to this.

    I'd be grateful if somebody could suggest further diagnostics and avenues for investigation. I'll post the VIP interface info in a comment.



  • I should add that these were built according to https://doc.pfsense.org/index.php/Configuring_pfSense_Hardware_Redundancy_(CARP) and set up as suggested at https://doc.pfsense.org/index.php/CARP_Configuration_Sync_Troubleshooting.

    Anyway, interface info:

    fw1:

    igb0: flags=8943 <up,broadcast,running,promisc,simplex,multicast>metric 0 mtu 1500
            options=400bb <rxcsum,txcsum,vlan_mtu,vlan_hwtagging,jumbo_mtu,vlan_hwcsum,vlan_hwtso>ether xxxx
            nd6 options=21 <performnud,auto_linklocal>media: Ethernet 100baseTX <full-duplex>
            status: active
    
    lagg0: flags=8943 <up,broadcast,running,promisc,simplex,multicast>metric 0 mtu 1500
            options=400bb <rxcsum,txcsum,vlan_mtu,vlan_hwtagging,jumbo_mtu,vlan_hwcsum,vlan_hwtso>ether xxxx
            inet6 xxxx%lagg0 prefixlen 64 scopeid 0xb
            inet 213.x.x.113 netmask 0xfffffff0 broadcast 213.x.x.127
            inet 213.x.x.115 netmask 0xfffffff0 broadcast 213.x.x.127 vhid 1
            inet 213.x.x.116 netmask 0xfffffff0 broadcast 213.x.x.127 vhid 5
            inet 213.x.x.117 netmask 0xfffffff0 broadcast 213.x.x.127 vhid 6
            inet 213.x.x.118 netmask 0xfffffff0 broadcast 213.x.x.127 vhid 7
            inet 213.x.x.119 netmask 0xfffffff0 broadcast 213.x.x.127 vhid 8
            inet 213.x.x.120 netmask 0xfffffff0 broadcast 213.x.x.127 vhid 9
            nd6 options=21 <performnud,auto_linklocal>media: Ethernet autoselect
            status: active
            carp: BACKUP vhid 1 advbase 1 advskew 0
            carp: BACKUP vhid 5 advbase 1 advskew 0
            carp: BACKUP vhid 6 advbase 1 advskew 0
            carp: BACKUP vhid 7 advbase 1 advskew 0
            carp: BACKUP vhid 8 advbase 1 advskew 0
            carp: BACKUP vhid 9 advbase 1 advskew 0
            laggproto failover lagghash l2,l3,l4
            laggport: bce0 flags=0<>
            laggport: igb0 flags=5<master,active></master,active></performnud,auto_linklocal></rxcsum,txcsum,vlan_mtu,vlan_hwtagging,jumbo_mtu,vlan_hwcsum,vlan_hwtso></up,broadcast,running,promisc,simplex,multicast></full-duplex></performnud,auto_linklocal></rxcsum,txcsum,vlan_mtu,vlan_hwtagging,jumbo_mtu,vlan_hwcsum,vlan_hwtso></up,broadcast,running,promisc,simplex,multicast>
    

    fw2:

    igb0: flags=8943 <up,broadcast,running,promisc,simplex,multicast>metric 0 mtu 1500
            options=400bb <rxcsum,txcsum,vlan_mtu,vlan_hwtagging,jumbo_mtu,vlan_hwcsum,vlan_hwtso>ether xxxx
            nd6 options=21 <performnud,auto_linklocal>media: Ethernet 100baseTX <full-duplex>
            status: active
    
    lagg0: flags=8943 <up,broadcast,running,promisc,simplex,multicast>metric 0 mtu 1500
            options=400bb <rxcsum,txcsum,vlan_mtu,vlan_hwtagging,jumbo_mtu,vlan_hwcsum,vlan_hwtso>ether xxxx
            inet6 xxxx%lagg0 prefixlen 64 scopeid 0xb
            inet 213.x.x.114 netmask 0xfffffff0 broadcast 213.x.x.127
            inet 213.x.x.115 netmask 0xfffffff0 broadcast 213.x.x.127 vhid 1
            inet 213.x.x.116 netmask 0xfffffff0 broadcast 213.x.x.127 vhid 5
            inet 213.x.x.117 netmask 0xfffffff0 broadcast 213.x.x.127 vhid 6
            inet 213.x.x.118 netmask 0xfffffff0 broadcast 213.x.x.127 vhid 7
            inet 213.x.x.119 netmask 0xfffffff0 broadcast 213.x.x.127 vhid 8
            inet 213.x.x.120 netmask 0xfffffff0 broadcast 213.x.x.127 vhid 9
            nd6 options=21 <performnud,auto_linklocal>media: Ethernet autoselect
            status: active
            carp: MASTER vhid 1 advbase 1 advskew 100
            carp: MASTER vhid 5 advbase 1 advskew 100
            carp: MASTER vhid 6 advbase 1 advskew 100
            carp: MASTER vhid 7 advbase 1 advskew 100
            carp: MASTER vhid 8 advbase 1 advskew 100
            carp: MASTER vhid 9 advbase 1 advskew 100
            laggproto failover lagghash l2,l3,l4
            laggport: bce0 flags=0<>
            laggport: igb0 flags=5<master,active></master,active></performnud,auto_linklocal></rxcsum,txcsum,vlan_mtu,vlan_hwtagging,jumbo_mtu,vlan_hwcsum,vlan_hwtso></up,broadcast,running,promisc,simplex,multicast></full-duplex></performnud,auto_linklocal></rxcsum,txcsum,vlan_mtu,vlan_hwtagging,jumbo_mtu,vlan_hwcsum,vlan_hwtso></up,broadcast,running,promisc,simplex,multicast>
    

    I can provide stats for the other interfaces, laggs and vlans if necessary.



  • Abbreviated packet capture of attempt to force failback to fw1 using tcpdump -i lagg0 -n proto CARP

    fw2 is master, though it shouldn't be:

    16:05:46.084051 IP 213.x.x.114 > 224.0.0.18: VRRPv2, Advertisement, vrid 5, prio 240, authtype none, intvl 1s, length 36
    16:05:46.084058 IP 213.x.x.114 > 224.0.0.18: VRRPv2, Advertisement, vrid 6, prio 240, authtype none, intvl 1s, length 36
    16:05:46.084065 IP 213.x.x.114 > 224.0.0.18: VRRPv2, Advertisement, vrid 7, prio 240, authtype none, intvl 1s, length 36
    

    Carp disabled on fw2, fw1 begins advertising:

    16:05:47.145275 IP 213.x.x.113 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 240, authtype none, intvl 1s, length 36
    16:05:48.039994 IP 213.x.x.114 > 224.0.0.18: VRRPv2, Advertisement, vrid 7, prio 240, authtype none, intvl 1s, length 36
    16:05:48.040009 IP 213.x.x.114 > 224.0.0.18: VRRPv2, Advertisement, vrid 6, prio 240, authtype none, intvl 1s, length 36
    16:05:48.040016 IP 213.x.x.114 > 224.0.0.18: VRRPv2, Advertisement, vrid 5, prio 240, authtype none, intvl 1s, length 36
    16:05:48.040055 IP 213.x.x.114 > 224.0.0.18: VRRPv2, Advertisement, vrid 9, prio 240, authtype none, intvl 1s, length 36
    16:05:48.040061 IP 213.x.x.114 > 224.0.0.18: VRRPv2, Advertisement, vrid 8, prio 240, authtype none, intvl 1s, length 36
    16:05:48.169258 IP 213.x.x.113 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 240, authtype none, intvl 1s, length 36
    16:05:49.978992 IP 213.x.x.114 > 224.0.0.18: VRRPv2, Advertisement, vrid 8, prio 240, authtype none, intvl 1s, length 36
    16:05:49.979005 IP 213.x.x.114 > 224.0.0.18: VRRPv2, Advertisement, vrid 9, prio 240, authtype none, intvl 1s, length 36
    16:05:49.979012 IP 213.x.x.114 > 224.0.0.18: VRRPv2, Advertisement, vrid 6, prio 240, authtype none, intvl 1s, length 36
    16:05:49.979018 IP 213.x.x.114 > 224.0.0.18: VRRPv2, Advertisement, vrid 7, prio 240, authtype none, intvl 1s, length 36
    16:05:50.108254 IP 213.x.x.113 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 240, authtype none, intvl 1s, length 36
    16:05:51.071274 IP 213.x.x.113 > 224.0.0.18: VRRPv2, Advertisement, vrid 5, prio 240, authtype none, intvl 1s, length 36
    16:05:51.917993 IP 213.x.x.114 > 224.0.0.18: VRRPv2, Advertisement, vrid 9, prio 240, authtype none, intvl 1s, length 36
    16:05:51.918007 IP 213.x.x.114 > 224.0.0.18: VRRPv2, Advertisement, vrid 8, prio 240, authtype none, intvl 1s, length 36
    

    fw1 becomes master:

    16:05:52.050284 IP 213.x.x.113 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 240, authtype none, intvl 1s, length 36
    16:05:52.085261 IP 213.x.x.113 > 224.0.0.18: VRRPv2, Advertisement, vrid 5, prio 240, authtype none, intvl 1s, length 36
    16:05:52.994273 IP 213.x.x.113 > 224.0.0.18: VRRPv2, Advertisement, vrid 7, prio 240, authtype none, intvl 1s, length 36
    

    CARP re-enabled on fw2:

    16:06:17.389273 IP 213.x.x.113 > 224.0.0.18: VRRPv2, Advertisement, vrid 7, prio 240, authtype none, intvl 1s, length 36
    16:06:17.401252 IP 213.x.x.113 > 224.0.0.18: VRRPv2, Advertisement, vrid 5, prio 240, authtype none, intvl 1s, length 36
    16:06:17.596993 IP 213.x.x.114 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 240, authtype none, intvl 1s, length 36
    16:06:18.326009 IP 213.x.x.114 > 224.0.0.18: VRRPv2, Advertisement, vrid 5, prio 240, authtype none, intvl 1s, length 36
    16:06:19.232261 IP 213.x.x.113 > 224.0.0.18: VRRPv2, Advertisement, vrid 9, prio 240, authtype none, intvl 1s, length 36
    16:06:19.232275 IP 213.x.x.113 > 224.0.0.18: VRRPv2, Advertisement, vrid 8, prio 240, authtype none, intvl 1s, length 36
    16:06:19.328260 IP 213.x.x.113 > 224.0.0.18: VRRPv2, Advertisement, vrid 7, prio 240, authtype none, intvl 1s, length 36
    16:06:19.328273 IP 213.x.x.113 > 224.0.0.18: VRRPv2, Advertisement, vrid 6, prio 240, authtype none, intvl 1s, length 36
    

    fw2 becomes master again:

    16:06:19.382004 IP 213.x.x.114 > 224.0.0.18: VRRPv2, Advertisement, vrid 6, prio 240, authtype none, intvl 1s, length 36
    16:06:19.556993 IP 213.x.x.114 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 240, authtype none, intvl 1s, length 36
    16:06:19.717988 IP 213.x.x.114 > 224.0.0.18: VRRPv2, Advertisement, vrid 5, prio 240, authtype none, intvl 1s, length 36
    

    The above sequence of CARP packets is visible on both firewalls.



  • Log of the failback to fw2 after CARP was re-enabled:

    fw1:

    Feb 19 16:06:14	kernel: carp: VHID 1@lagg0: MASTER -> BACKUP (more frequent advertisement received)
    Feb 19 16:06:15	kernel: carp: VHID 2@lagg1_vlan2: MASTER -> BACKUP (more frequent advertisement received)
    Feb 19 16:06:15	php-fpm[56729]: /rc.carpbackup: Carp cluster member "213.x.x.115 - CARP0 WAN (1@lagg0)" has resumed the state "BACKUP" for vhid 1@lagg0
    Feb 19 16:06:16	kernel: carp: VHID 3@lagg1_vlan3: MASTER -> BACKUP (more frequent advertisement received)
    Feb 19 16:06:16	php-fpm[56729]: /rc.carpbackup: Carp cluster member "192.168.2.254 - PROD LAN CARP (2@lagg1_vlan2)" has resumed the state "BACKUP" for vhid 2@lagg1_vlan2
    Feb 19 16:06:17	kernel: carp: VHID 4@lagg1: MASTER -> BACKUP (more frequent advertisement received)
    Feb 19 16:06:17	php-fpm[56729]: /rc.carpbackup: Carp cluster member "192.168.3.254 - DEV VLAN CARP (3@lagg1_vlan3)" has resumed the state "BACKUP" for vhid 3@lagg1_vlan3
    Feb 19 16:06:18	check_reload_status: Carp backup event
    Feb 19 16:06:18	kernel: carp: VHID 5@lagg0: MASTER -> BACKUP (more frequent advertisement received)
    Feb 19 16:06:18	php-fpm[56729]: /rc.carpbackup: Carp cluster member "192.168.5.254 - MANAGEMENT CARP (4@lagg1)" has resumed the state "BACKUP" for vhid 4@lagg1
    Feb 19 16:06:19	check_reload_status: Carp backup event
    Feb 19 16:06:19	kernel: carp: VHID 6@lagg0: MASTER -> BACKUP (more frequent advertisement received)
    Feb 19 16:06:19	php-fpm[56729]: /rc.carpbackup: Carp cluster member "213.x.x.116 - (5@lagg0)" has resumed the state "BACKUP" for vhid 5@lagg0
    Feb 19 16:06:20	check_reload_status: Carp backup event
    Feb 19 16:06:20	kernel: carp: VHID 7@lagg0: MASTER -> BACKUP (more frequent advertisement received)
    Feb 19 16:06:20	php-fpm[56729]: /rc.carpbackup: Carp cluster member "213.x.x.117 - (6@lagg0)" has resumed the state "BACKUP" for vhid 6@lagg0
    Feb 19 16:06:21	kernel: carp: VHID 8@lagg0: MASTER -> BACKUP (more frequent advertisement received)
    Feb 19 16:06:21	php-fpm[56729]: /rc.carpbackup: Carp cluster member "213.x.x.118 - (7@lagg0)" has resumed the state "BACKUP" for vhid 7@lagg0
    Feb 19 16:06:22	check_reload_status: Carp backup event
    Feb 19 16:06:22	kernel: carp: VHID 9@lagg0: MASTER -> BACKUP (more frequent advertisement received)
    Feb 19 16:06:22	php-fpm[56729]: /rc.carpbackup: Carp cluster member "213.x.x.119 - (8@lagg0)" has resumed the state "BACKUP" for vhid 8@lagg0
    Feb 19 16:06:23	php-fpm[56729]: /rc.carpbackup: Carp cluster member "213.x.x.120 - (9@lagg0)" has resumed the state "BACKUP" for vhid 9@lagg0
    

    fw2:

    Feb 19 16:06:17	check_reload_status: Carp master event
    Feb 19 16:06:17	kernel: carp: VHID 4@lagg1: BACKUP -> MASTER (master down)
    Feb 19 16:06:17	php-fpm[90842]: /rc.carpmaster: Carp cluster member "192.168.3.254 - DEV VLAN CARP (3@lagg1_vlan3)" has resumed the state "MASTER" for vhid 3@lagg1_vlan3
    Feb 19 16:06:17	check_reload_status: Carp backup event
    Feb 19 16:06:17	kernel: carp: VHID 8@lagg0: INIT -> BACKUP
    Feb 19 16:06:18	check_reload_status: Carp master event
    Feb 19 16:06:18	kernel: carp: VHID 5@lagg0: BACKUP -> MASTER (master down)
    Feb 19 16:06:18	php-fpm[91664]: /rc.carpbackup: Carp cluster member "213.x.x.118 - (7@lagg0)" has resumed the state "BACKUP" for vhid 7@lagg0
    Feb 19 16:06:18	php-fpm[90842]: /rc.carpmaster: Carp cluster member "192.168.5.254 - MANAGEMENT CARP (4@lagg1)" has resumed the state "MASTER" for vhid 4@lagg1
    Feb 19 16:06:18	check_reload_status: Carp backup event
    Feb 19 16:06:18	kernel: carp: VHID 9@lagg0: INIT -> BACKUP
    Feb 19 16:06:19	php-fpm[91664]: /rc.carpbackup: Carp cluster member "213.x.x.119 - (8@lagg0)" has resumed the state "BACKUP" for vhid 8@lagg0
    Feb 19 16:06:19	check_reload_status: Carp master event
    Feb 19 16:06:19	kernel: carp: VHID 6@lagg0: BACKUP -> MASTER (master down)
    Feb 19 16:06:19	php-fpm[90842]: /rc.carpmaster: Carp cluster member "213.x.x.116 - (5@lagg0)" has resumed the state "MASTER" for vhid 5@lagg0
    Feb 19 16:06:19	kernel: carp: demoted by 240 to 480 (pfsync bulk start)
    Feb 19 16:06:20	kernel: carp: demoted by -240 to 240 (pfsync bulk done)
    Feb 19 16:06:20	php-fpm[91664]: /rc.carpbackup: Carp cluster member "213.x.x.120 - (9@lagg0)" has resumed the state "BACKUP" for vhid 9@lagg0
    Feb 19 16:06:20	check_reload_status: Carp master event
    Feb 19 16:06:20	kernel: carp: VHID 7@lagg0: BACKUP -> MASTER (master down)
    Feb 19 16:06:20	php-fpm[90842]: /rc.carpmaster: Carp cluster member "213.x.x.117 - (6@lagg0)" has resumed the state "MASTER" for vhid 6@lagg0
    Feb 19 16:06:20	php-fpm[80124]: /carp_status.php: waiting for pfsync...
    Feb 19 16:06:21	check_reload_status: Carp master event
    Feb 19 16:06:21	kernel: carp: VHID 8@lagg0: BACKUP -> MASTER (master down)
    Feb 19 16:06:21	php-fpm[91664]: /rc.carpmaster: Carp cluster member "213.x.x.118 - (7@lagg0)" has resumed the state "MASTER" for vhid 7@lagg0
    Feb 19 16:06:22	kernel: carp: VHID 9@lagg0: BACKUP -> MASTER (master down)
    Feb 19 16:06:22	check_reload_status: Carp master event
    Feb 19 16:06:22	php-fpm[91664]: /rc.carpmaster: Carp cluster member "213.x.x.119 - (8@lagg0)" has resumed the state "MASTER" for vhid 8@lagg0
    Feb 19 16:06:23	php-fpm[91664]: /rc.carpmaster: Carp cluster member "213.x.x.120 - (9@lagg0)" has resumed the state "MASTER" for vhid 9@lagg0
    Feb 19 16:06:52	php-fpm[80124]: /carp_status.php: pfsync done in 30 seconds.
    Feb 19 16:06:52	php-fpm[80124]: /carp_status.php: Configuring CARP settings finalize...
    

    Why would fw2 be advertising more frequently when advskew is 0 on fw1 and 100 on fw2?



  • Hi!

    I've a CARP system running on nearly an equal or equal hardware, DELL R210 II with a quad port Intel igb NIC card. I've upgraded them from 2.1.4 to 2.2 three weeks ago, and it works without an issue since that time.

    I have just modified two boot parameters for the interfaces:

    kern.ipc.nmbclusters="131072"
    hw.bce1.fc_setting=0
    

    However, I have no lagg configuration.
    I've all NICs on the Intel card and one onboard (for sync) in use. All of the four interfaces are connected to switches.
    I tested failover by pulling out a network cable from master, and all VIPs moved to backup on FW1 and these on FW2 became master. After plug in the cable FW1 became master again.

    Maybe CARP in 2.2 have trouble in combination with lagg interfaces. There are other unsolved threads here describing similar problems.



  • I also have a weird issue involving CARP + LAGG where my system ends up in a split-brain mode.  The backup unit has MASTER on the LAN while the master unit has MASTER on WAN and DMZ.

    Found some messages I'd never seen before in dmesg output:

    
    carp: demoted by 240 to 480 (pfsync bulk start)
    carp: demoted by -240 to 240 (pfsync bulk fail)
    carp: demoted by 0 to 240 (sysctl)
    carp: demoted by 0 to 240 (sysctl)
    carp: demoted by 240 to 480 (pfsync bulk start)
    carp: demoted by -240 to 240 (pfsync bulk fail)
    carp: demoted by 240 to 480 (pfsync bulk start)
    carp: demoted by -240 to 240 (pfsync bulk fail)
    carp: demoted by 240 to 480 (pfsync bulk start)
    carp: demoted by -240 to 240 (pfsync bulk fail)
    
    

    and similar stuff in system.log

    
    [2.2-RELEASE][root@pfsense1]/root: tail /var/log/system.log 
    Feb 18 15:50:11 pfsense1 check_reload_status: Syncing firewall
    Feb 18 15:50:11 pfsense1 kernel: carp: demoted by 240 to 480 (pfsync bulk start)
    Feb 18 15:50:12 pfsense1 php-fpm[29759]: /system_hasync.php: waiting for pfsync...
    Feb 18 15:50:12 pfsense1 php-fpm[29759]: /system_hasync.php: pfsync done in 0 seconds.
    Feb 18 15:50:12 pfsense1 php-fpm[29759]: /system_hasync.php: Configuring CARP settings finalize...
    Feb 18 15:50:12 pfsense1 php-fpm[39031]: /rc.filter_synchronize: Beginning XMLRPC sync to https://10.21.2.252:443.
    Feb 18 15:50:12-pfsense1 php-fpm[39031]: /rc.filter_synchronize: XMLRPC sync successfully completed with https://10.21.2.252:443.
    Feb 18 15:50:16 pfsense1 php-fpm[39031]: /rc.filter_synchronize: Filter sync successfully completed with https://10.21.2.252:443.
    Feb 18 15:51:16 pfsense1 kernel: carp: demoted by -240 to 240 (pfsync bulk fail)
    
    

    this setup did work fine before the upgrade to 2.2 and I have identical hardware working at 2 sites (also running 2.2) while this site refuses to pysync correctly.

    I use a 4x Intel NIC with em driver and 2x onboard using igb just in case.  pfsync NIC is em2.

    Currently the xml sync is working so my config is propagating to the backup but the CARP state sync is failing so my MASTER always says:

    CARP has detected a problem and this unit has been demoted to BACKUP status.
    Check link status on all interfaces with configured CARP VIPs.

    Which I gather it knows because this sysctl is not zero.

    net.inet.carp.demotion: 240
    

    This is a remote site and some switch may have been replaced without my knowledge so I'm looking for the magic tcpdump commands to run to verify the CARP traffic now.



  • My testing with tcpdump -i $NIC -ttt -n proto CARP seems to show the backup sees the VRRPv2 Advertisement on all 3 CARP interface I use.

    Is there a way I can force clear the net.inet.carp.demotion value back to zero??

    It seems pretty convinced that it's in a bad state:

    
    [2.2-RELEASE][root@la-pfsense1]/root: sysctl net.inet.carp.demotion=0
    net.inet.carp.demotion: 240 -> 240
    
    


  • In case anyone else is looking…

    sysctl net.inet.carp.demotion=-240

    net.inet.carp.demotion: 240 -> 0

    Now I'm just waiting for someone to be on site before I restore the backup.



  • So my problem seems to be due to the crossover cable I used which was supposed to just be a temporary measure until something actually got put in the DMZ.

    After setting the backup to maintenance mode I applied the previous sysctl -240.  I then turned off maintenance mode on the backup and things were in a good state.

    As a test I rebooted the backup unit but almost immediately saw the message on the master unit's Status -> CARP web UI:

    CARP has detected a problem and this unit has been demoted to BACKUP status.
    Check link status on all interfaces with configured CARP VIPs.

    then found this in master unit's dmesg:

    
    carp: demoted by 240 to 240 (interface down)
    em3: link state changed to DOWN
    carp: VHID 222@em3: INIT -> BACKUP
    carp: demoted by -240 to 0 (interface up)
    em3: link state changed to UP
    carp: demoted by 240 to 240 (pfsync bulk start)
    carp: VHID 222@em3: BACKUP -> MASTER (master down)
    ifa_add_loopback_route: insertion failed: 17
    
    

    So I quickly reapplied the sysctl -240 and the master stayed in charge after the backup finished it's reboot and now I know I need to put a switch in there.

    Sorry to hijack your thread a bit there viragomann… I hope this helps somebody!



  • FWIW the problem is ongoing and I managed to capture the following errors at boot time

    sysctl: oid 'hw.bce.tso_enable' is a read only tunable
    sysctl: Tunable values are set in /boot/loader.conf
    sysctl: oid 'hw.igb.num_queues' is a read only tunable
    sysctl: Tunable values are set in /boot/loader.conf

    I noticed that https://doc.pfsense.org/index.php/Tuning_and_Troubleshooting_Network_Cards was changed on 16/02/15 so I have changed the above settings by dropping hw.igb.num_queues entirely and putting hw.bce.tso_enable in /boot/loader.conf.local. That cleaned up the errors but the CARP VIPs are still failing over to fw2 and fw1 is still logging that more frequent advertisements are being received  :-\

    carp: VHID 1@lagg0: MASTER -> BACKUP (more frequent advertisement received)
    carp: VHID 2@lagg1_vlan2: MASTER -> BACKUP (more frequent advertisement received)
    arp: 192.168.2.254 moved from xxxx to xxxx on lagg1_vlan2
    carp: VHID 3@lagg1_vlan3: MASTER -> BACKUP (more frequent advertisement received)
    carp: VHID 4@lagg1: MASTER -> BACKUP (more frequent advertisement received)
    carp: VHID 5@lagg0: MASTER -> BACKUP (more frequent advertisement received)
    carp: VHID 6@lagg0: MASTER -> BACKUP (more frequent advertisement received)
    carp: VHID 7@lagg0: MASTER -> BACKUP (more frequent advertisement received)
    carp: VHID 8@lagg0: MASTER -> BACKUP (more frequent advertisement received)
    carp: VHID 9@lagg0: MASTER -> BACKUP (more frequent advertisement received)

    Any thoughts on why that might be if I can see the advertisements on both sides as in the packet capture above?



  • Dear All,

    similar issues did appear with igb NICs on my two sets of duplicated servers Intel(R) Atom(TM) CPU C2758 @ 2.40GHz wit 16 GB RAM (reported https://forum.pfsense.org/index.php?topic=89132.0).

    My /boot/loader.conf is:

    autoboot_delay="3"
    vm.kmem_size="536870912"
    vm.kmem_size_max="1073741824"
    kern.ipc.nmbclusters="1000000"
    comconsole_speed="9600"
    hw.usb.no_pf="1"
    hw.igb.fc_setting=0

    Regards,

    Michael Schefczyk



  • To add closure to this issue, the problem went away by resetting sysctl net.inet.carp.demotion from 240 to 0 with:

    sysctl net.inet.carp.demotion=-240

    sysctl net.inet.carp.demotion is essentially a penalty against the advskew settings. Returning this to 0 made the VIPs stable and removed the warning from the CARP status page, though it would recur following a reboot.

    According to https://forum.pfsense.org/index.php?topic=89132.msg496865#msg496865 the problem is caused when using CARP on a LAGG. When the LAGG is initialised it loses some CARP advertisements and causes net.inet.carp.demotion to be increased by the value of net.inet.carp.senderr_demotion_factor (240).

    Setting:

    net.inet.carp.senderr_demotion_factor=0

    means that this issue no longer occurs when at boot time and is therefore resolved permanently.


Log in to reply