CARP not working after upgrade from 2.1.5 to 2.2 II


  • Dear All,

    After posting https://forum.pfsense.org/index.php?topic=87633.0 some weeks ago, I did upgrade the other end of my dual-SOHO VPN setting from pfSense 2.1.5 to 2.2. Most things work, for example increasing the OpenVPN auth diggest quality (thanks!).

    The machines upgraded now are also two supermicro Intel(R) Atom(TM) CPU C2758 with LAGG interfaces. Before upgrading, everything did work. After upgrading, I am facing issues similar to my own post above and ads76's post https://forum.pfsense.org/index.php?topic=89085.0

    The issues are:

    • While one machine reliably becomes master and the other one backup per interface, there is no systematic control anymore, which machine is master and which is backup. This used to work via the skew. This is problematic, as HA sync is directed from master to backup.

    • Furthermore, split brain situations can occur in which one machine is master for the two WAN interfaces and the other machine for all LAN interfaces or maybe for just one of them. This goes away after rebooting machines – but that is not the idea of high availability.

    In my case, it is not an issue of connecting the sync interfaces with a straight cable or with a crossed cable. I seem to be getting the same results regardless. MBUF looks normal (27 % used). Also syncing seems to work in general, at least pinging via sync interface works, configuration changes are synced and the state table size looks similar on both machines (cannot track exactly as the browser does not refresh momentarily).

    Is there any advice, please?

    Regards,

    Michael


  • Hi Michael

    It does sound like we have the same issue. Yours is a 2.2 upgrade too, are you also using LAGG interfaces?

    As I haven't solved my issue, I can't help you with you, other than suggest what I've already done to try diagnose the issue. Hopefully, you'll resolve yours using some of the things I've tried. Note you'll need to be logged in using SSH to do most of this stuff. Be very careful if you're not familiar on the the command line.

    Take a look here to tune your NICs, note many of these go in /boot/loader.conf.local. You have to create it if it doesn't exist. You should put any edits in there rather than /boot/loader.conf.

    https://doc.pfsense.org/index.php/Tuning_and_Troubleshooting_Network_Cards

    Check that CARP preempt is enabled, it should be (1) by default. If it's 0, that would explain why your addresses are getting split across the firewalls when a problem is detected, though mine is 1 and it still happens sometimes.

    sysctl net.inet.carp.preempt

    Secondly, what mode are your interfaces connected in? Autoselect? What about the other end? You can see this info in the web GUI too.

    ifconfig igb0

    Do that for all of your NICs. Take a look at your upstream and downstream devices too. Ask your colo provider if your upstream devices aren't managed by you.

    Can you see any packet errors on your interfaces?

    netstat -idb -I igb0

    Take a look on all of your NICs. Packet errors and collisions are a sign that your network links are not OK. It might be a loose cable, it might be a duplex mismatch between your firewalls and the switch you're connected to.

    Take a look a dmesg to see what happens to your CARP VIPs, this might highlight any cabling issues if you see interfaces going up or down.

    dmesg | less

    Do a tcpdump on both firewalls simultaneously, listening for CARP packets, while you disable CARP on your current master, then re-enable it, disable on the other, then re-enable. You will see whether each side sends and receives advertisements and also outputting to text file will allow you to review it:

    using tcpdump -i lagg0 -n proto CARP | tee -a /root/carp.txt

    (CTRL-c to stop.) Output to screen seems slow while also outputting to file. Use less /root/carp.txt to reach your output files (q to quit).

    Review dmesg again to see what happened.

    Hope that helps you a little, though if we have the same problem I suspect it won't  :-\


  • Additional thought. I've noticed a great variation in ping response times between the firewalls over the WAN link (ours communicate on the WAN side via the colo providers upstream switch ports). The colo provider uses HSRP to provide a VIP gateway address, but their switchports have addresses of their own too, much like CARP.

    I noticed that there's great variation in response times between our firewalls over the WAN and also between the firewalls and the upstream gateway address. They are much slower on average than on our LAN side by around 100 times. Additionally, one of the upstream ports has the same response time range and variation as the gateway address, while the other is much faster and more consistent, not too far off of our LAN response times. I have asked our colo provider to investigate. You wish to try the same investigation yourself:

    1. Ping fw2 from fw1 and fw1 from fw2 over the WAN address
    2. Ping from fw1 and fw2 to your upstream gateway address
    3. If your upstream device provides individual addresses plus a gateway VIP, ping each of them from both firewalls and note any differences.

    You may find that the issue lies with slow and unreliable response times from your upstream device. I could be wrong but I have yet to hear from the colo provider.


  • Dear Ads76,

    Thank you very much for your suggestions. I could not resolve my issue either. My findings are:

    1. LAGG0 used for the LAN interface and all VLANS running tagged on that interface.

    2. I have IPV6 allowed in system advanced in general but not configured on any interface.

    3. net.inet.carp.preemt=1 active in systemctl. To be on the safe side, added to etc/sysctl.conf also, but with no effect: Split brains do uccur from time to time.

    4. boot/loader.conf as stated here: https://forum.pfsense.org/index.php?topic=89085.msg493521#msg493521    For igb NICs, hw.pci.enable_msix=0 is not mentioned in the tuning guide. Therefore, I did not use that parameter.

    5. MBUF usage constantly at 5 %

    6. Interfaces autoselect, but all at 1000baseT full-duplex

    7. netstat erros and collicions at zero on all interfaces

    8. dmesg | less for one of the servers appended below.

    9. tcpdump of CARP packets show a number of multicast packets for example on the LAN interface when disabling and enabling CARP. Status CARP frequently shows "CARP has detected a problem and this unit has been demoted to BACKUP status. Check link status on all interfaces with configured CARP VIPs." Nevertheless, the interfaces may well be shown as master.

    10. pinging upstream is no problem. This is probably because the servers have dual WAN and in front of each server there is one AVM Fritzbox modem/router to connect the pfSense server to DSL or to CATV. Thus, pinging these devices right in front of the server is quick.

    Any further advice would be highly welcome.

    Regards,

    Michael

    $ dmesg | less
    Copyright © 1992-2014 The FreeBSD Project.
    Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
    The Regents of the University of California. All rights reserved.
    FreeBSD is a registered trademark of The FreeBSD Foundation.
    FreeBSD 10.1-RELEASE-p4 #0 36d7dec(releng/10.1)-dirty: Thu Jan 22 15:12:35 CST 2015
        root@pfsense-22-amd64-builder:/usr/obj.amd64/usr/pfSensesrc/src/sys/pfSense_SMP.10 amd64
    FreeBSD clang version 3.4.1 (tags/RELEASE_34/dot1-final 208032) 20140512
    CPU: Intel(R) Atom(TM) CPU  C2758  @ 2.40GHz (2400.06-MHz K8-class CPU)
      Origin = "GenuineIntel"  Id = 0x406d8  Family = 0x6  Model = 0x4d  Stepping = 8
      Features=0xbfebfbff <fpu,vme,de,pse,tsc,msr,pae,mce,cx8,apic,sep,mtrr,pge,mca,cmov,pat,pse36,clflush,dts,acpi,mmx,fxsr,sse,sse2,ss,htt,tm,pbe>Features2=0x43d8e3bf <sse3,pclmulqdq,dtes64,mon,ds_cpl,vmx,est,tm2,ssse3,cx16,xtpr,pdcm,sse4.1,sse4.2,movbe,popcnt,tscdlt,aesni,rdrand>AMD Features=0x28100800 <syscall,nx,rdtscp,lm>AMD Features2=0x101 <lahf,prefetch>Structured Extended Features=0x2282 <tscadj,smep,erms>VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID
      TSC: P-state invariant, performance statistics
    real memory  = 17179869184 (16384 MB)
    avail memory = 16567734272 (15800 MB)
    Event timer "LAPIC" quality 600
    ACPI APIC Table: <intel  tiano ="">FreeBSD/SMP: Multiprocessor System Detected: 8 CPUs
    FreeBSD/SMP: 1 package(s) x 8 core(s)
    cpu0 (BSP): APIC ID:  0
    cpu1 (AP): APIC ID:  2
    cpu2 (AP): APIC ID:  4
    cpu3 (AP): APIC ID:  6
    cpu4 (AP): APIC ID:  8
    cpu5 (AP): APIC ID: 10
    cpu6 (AP): APIC ID: 12
    cpu7 (AP): APIC ID: 14
    ACPI BIOS Warning (bug): Invalid length for FADT/Pm1aControlBlock: 32, using default 16 (20130823/tbfadt-682)
    ioapic0 <version 2.0="">irqs 0-23 on motherboard
    wlan: mac acl policy registered
    ipw_bss: You need to read the LICENSE file in /usr/share/doc/legal/intel_ipw/.
    ipw_bss: If you agree with the license, set legal.intel_ipw.license_ack=1 in /boot/loader.conf.
    module_register_init: MOD_LOAD (ipw_bss_fw, 0xffffffff80606c30, 0) error 1
    ipw_ibss: You need to read the LICENSE file in /usr/share/doc/legal/intel_ipw/.
    ipw_ibss: If you agree with the license, set legal.intel_ipw.license_ack=1 in /boot/loader.conf.
    module_register_init: MOD_LOAD (ipw_ibss_fw, 0xffffffff80606ce0, 0) error 1
    ipw_monitor: You need to read the LICENSE file in /usr/share/doc/legal/intel_ipw/.
    ipw_monitor: If you agree with the license, set legal.intel_ipw.license_ack=1 in /boot/loader.conf.
    module_register_init: MOD_LOAD (ipw_monitor_fw, 0xffffffff80606d90, 0) error 1
    iwi_bss: You need to read the LICENSE file in /usr/share/doc/legal/intel_iwi/.
    iwi_bss: If you agree with the license, set legal.intel_iwi.license_ack=1 in /boot/loader.conf.
    module_register_init: MOD_LOAD (iwi_bss_fw, 0xffffffff8062e400, 0) error 1
    iwi_ibss: You need to read the LICENSE file in /usr/share/doc/legal/intel_iwi/.
    iwi_ibss: If you agree with the license, set legal.intel_iwi.license_ack=1 in /boot/loader.conf.
    module_register_init: MOD_LOAD (iwi_ibss_fw, 0xffffffff8062e4b0, 0) error 1
    iwi_monitor: You need to read the LICENSE file in /usr/share/doc/legal/intel_iwi/.
    iwi_monitor: If you agree with the license, set legal.intel_iwi.license_ack=1 in /boot/loader.conf.
    module_register_init: MOD_LOAD (iwi_monitor_fw, 0xffffffff8062e560, 0) error 1
    random: <software, yarrow="">initialized
    module_register_init: MOD_LOAD (vesa, 0xffffffff80fb8b00, 0) error 19
    kbd0 at kbdmux0
    cryptosoft0: <software crypto="">on motherboard
    padlock0: No ACE support.
    acpi0: <alaska a="" m="" i="">on motherboard
    acpi0: Power Button (fixed)
    cpu0: <acpi cpu="">on acpi0
    cpu1: <acpi cpu="">on acpi0
    cpu2: <acpi cpu="">on acpi0
    cpu3: <acpi cpu="">on acpi0
    cpu4: <acpi cpu="">on acpi0
    cpu5: <acpi cpu="">on acpi0
    cpu6: <acpi cpu="">on acpi0
    cpu7: <acpi cpu="">on acpi0
    hpet0: <high precision="" event="" timer="">iomem 0xfed00000-0xfed003ff on acpi0
    Timecounter "HPET" frequency 14318180 Hz quality 950
    Event timer "HPET" frequency 14318180 Hz quality 350
    Event timer "HPET1" frequency 14318180 Hz quality 340
    Event timer "HPET2" frequency 14318180 Hz quality 340
    atrtc0: <at realtime="" clock="">port 0x70-0x77 irq 8 on acpi0
    atrtc0: Warning: Couldn't map I/O.
    Event timer "RTC" frequency 32768 Hz quality 0
    attimer0: <at timer="">port 0x40-0x43,0x50-0x53 irq 0 on acpi0
    Timecounter "i8254" frequency 1193182 Hz quality 0
    Event timer "i8254" frequency 1193182 Hz quality 100
    Timecounter "ACPI-fast" frequency 3579545 Hz quality 900
    acpi_timer0: <24-bit timer at 3.579545MHz> port 0x408-0x40b on acpi0
    pcib0: <acpi host-pci="" bridge="">port 0xcf8-0xcff on acpi0
    pci0: <acpi pci="" bus="">on pcib0
    pcib1: <acpi pci-pci="" bridge="">mem 0xdf6e0000-0xdf6fffff irq 16 at device 1.0 on pci0
    pci1: <acpi pci="" bus="">on pcib1
    pcib2: <acpi pci-pci="" bridge="">at device 0.0 on pci1
    pci2: <acpi pci="" bus="">on pcib2
    vgapci0: <vga-compatible display="">port 0xd000-0xd07f mem 0xde000000-0xdeffffff,0xdf000000-0xdf01ffff irq 16 at device 0.0 on pci2
    vgapci0: Boot video device
    pcib3: <acpi pci-pci="" bridge="">mem 0xdf6c0000-0xdf6dffff irq 16 at device 2.0 on pci0
    pci3: <acpi pci="" bus="">on pcib3
    xhci0: <xhci (generic)="" usb="" 3.0="" controller="">mem 0xdf500000-0xdf501fff irq 17 at device 0.0 on pci3
    xhci0: 64 byte context size.
    usbus0 on xhci0
    pcib4: <acpi pci-pci="" bridge="">mem 0xdf6a0000-0xdf6bffff irq 20 at device 3.0 on pci0
    pci4: <acpi pci="" bus="">on pcib4
    igb0: <intel(r) 1000="" pro="" network="" connection="" version="" -="" 2.4.0="">port 0xc020-0xc03f mem 0xdf200000-0xdf2fffff,0xdf404000-0xdf407fff irq 22 at device 0.0 on pci4
    igb0: Using MSIX interrupts with 9 vectors
    igb0: Bound queue 0 to cpu 0
    igb0: Bound queue 1 to cpu 1
    igb0: Bound queue 2 to cpu 2
    igb0: Bound queue 3 to cpu 3
    igb0: Bound queue 4 to cpu 4
    igb0: Bound queue 5 to cpu 5
    igb0: Bound queue 6 to cpu 6
    igb0: Bound queue 7 to cpu 7
    igb1: <intel(r) 1000="" pro="" network="" connection="" version="" -="" 2.4.0="">port 0xc000-0xc01f mem 0xdf100000-0xdf1fffff,0xdf400000-0xdf403fff irq 23 at device 0.1 on pci4
    igb1: Using MSIX interrupts with 9 vectors
    igb1: Bound queue 0 to cpu 0
    igb1: Bound queue 1 to cpu 1
    igb1: Bound queue 2 to cpu 2
    igb1: Bound queue 3 to cpu 3
    igb1: Bound queue 4 to cpu 4
    igb1: Bound queue 5 to cpu 5
    igb1: Bound queue 6 to cpu 6
    igb1: Bound queue 7 to cpu 7
    pci0: <processor>at device 11.0 (no driver attached)
    pci0: <base peripheral,="" iommu=""> at device 15.0 (no driver attached)
    igb2: <intel(r) 1000="" pro="" network="" connection="" version="" -="" 2.4.0="">port 0xe0c0-0xe0df mem 0xdf660000-0xdf67ffff,0xdf70c000-0xdf70ffff irq 20 at device 20.0 on pci0
    igb2: Using MSIX interrupts with 9 vectors
    igb2: Bound queue 0 to cpu 0
    igb2: Bound queue 1 to cpu 1
    igb2: Bound queue 2 to cpu 2
    igb2: Bound queue 3 to cpu 3
    igb2: Bound queue 4 to cpu 4
    igb2: Bound queue 5 to cpu 5
    igb2: Bound queue 6 to cpu 6
    igb2: Bound queue 7 to cpu 7
    igb3: <intel(r) 1000="" pro="" network="" connection="" version="" -="" 2.4.0="">port 0xe0a0-0xe0bf mem 0xdf640000-0xdf65ffff,0xdf708000-0xdf70bfff irq 21 at device 20.1 on pci0
    igb3: Using MSIX interrupts with 9 vectors
    igb3: Bound queue 0 to cpu 0
    igb3: Bound queue 1 to cpu 1
    igb3: Bound queue 2 to cpu 2
    igb3: Bound queue 3 to cpu 3
    igb3: Bound queue 4 to cpu 4
    igb3: Bound queue 5 to cpu 5
    igb3: Bound queue 6 to cpu 6
    igb3: Bound queue 7 to cpu 7
    igb4: <intel(r) 1000="" pro="" network="" connection="" version="" -="" 2.4.0="">port 0xe080-0xe09f mem 0xdf620000-0xdf63ffff,0xdf704000-0xdf707fff irq 22 at device 20.2 on pci0
    igb4: Using MSIX interrupts with 9 vectors
    igb4: Bound queue 0 to cpu 0
    igb4: Bound queue 1 to cpu 1
    igb4: Bound queue 2 to cpu 2
    igb4: Bound queue 3 to cpu 3
    igb4: Bound queue 4 to cpu 4
    igb4: Bound queue 5 to cpu 5
    igb4: Bound queue 6 to cpu 6
    igb4: Bound queue 7 to cpu 7
    igb5: <intel(r) 1000="" pro="" network="" connection="" version="" -="" 2.4.0="">port 0xe060-0xe07f mem 0xdf600000-0xdf61ffff,0xdf700000-0xdf703fff irq 23 at device 20.3 on pci0
    igb5: Using MSIX interrupts with 9 vectors
    igb5: Bound queue 0 to cpu 0
    igb5: Bound queue 1 to cpu 1
    igb5: Bound queue 2 to cpu 2
    igb5: Bound queue 3 to cpu 3
    igb5: Bound queue 4 to cpu 4
    igb5: Bound queue 5 to cpu 5
    igb5: Bound queue 6 to cpu 6
    igb5: Bound queue 7 to cpu 7
    ehci0: <intel avoton="" usb="" 2.0="" controller="">mem 0xdf717000-0xdf7173ff irq 23 at device 22.0 on pci0
    usbus1: EHCI version 1.0
    usbus1 on ehci0
    ahci0: <intel avoton="" ahci="" sata="" controller="">port 0xe150-0xe157,0xe140-0xe143,0xe130-0xe137,0xe120-0xe123,0xe040-0xe05f mem 0xdf716000-0xdf7167ff irq 19 at device 23.0 on pci0
    ahci0: AHCI v1.30 with 4 3Gbps ports, Port Multiplier not supported
    ahcich0: <ahci channel="">at channel 0 on ahci0
    ahcich1: <ahci channel="">at channel 1 on ahci0
    ahcich2: <ahci channel="">at channel 2 on ahci0
    ahcich3: <ahci channel="">at channel 3 on ahci0
    ahci1: <intel avoton="" ahci="" sata="" controller="">port 0xe110-0xe117,0xe100-0xe103,0xe0f0-0xe0f7,0xe0e0-0xe0e3,0xe020-0xe03f mem 0xdf715000-0xdf7157ff irq 19 at device 24.0 on pci0
    ahci1: AHCI v1.30 with 2 6Gbps ports, Port Multiplier not supported
    ahcich4: <ahci channel="">at channel 0 on ahci1
    ahcich5: <ahci channel="">at channel 1 on ahci1
    isab0: <pci-isa bridge="">at device 31.0 on pci0
    isa0: <isa bus="">on isab0
    uart0: <16550 or compatible> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0
    uart1: <16550 or compatible> port 0x2f8-0x2ff irq 3 on acpi0
    orm0: <isa option="" roms="">at iomem 0xc0000-0xc7fff,0xc8000-0xc8fff,0xc9000-0xc9fff,0xca000-0xcafff on isa0
    sc0: <system console="">at flags 0x100 on isa0
    sc0: CGA <16 virtual consoles, flags=0x300>
    vga0: <generic isa="" vga="">at port 0x3d0-0x3db iomem 0xb8000-0xbffff on isa0
    ppc0: cannot reserve I/O port range
    est0: <enhanced speedstep="" frequency="" control="">on cpu0
    p4tcc0: <cpu frequency="" thermal="" control="">on cpu0
    est1: <enhanced speedstep="" frequency="" control="">on cpu1
    p4tcc1: <cpu frequency="" thermal="" control="">on cpu1
    est2: <enhanced speedstep="" frequency="" control="">on cpu2
    p4tcc2: <cpu frequency="" thermal="" control="">on cpu2
    est3: <enhanced speedstep="" frequency="" control="">on cpu3
    p4tcc3: <cpu frequency="" thermal="" control="">on cpu3
    est4: <enhanced speedstep="" frequency="" control="">on cpu4
    p4tcc4: <cpu frequency="" thermal="" control="">on cpu4
    est5: <enhanced speedstep="" frequency="" control="">on cpu5
    p4tcc5: <cpu frequency="" thermal="" control="">on cpu5
    est6: <enhanced speedstep="" frequency="" control="">on cpu6
    p4tcc6: <cpu frequency="" thermal="" control="">on cpu6
    est7: <enhanced speedstep="" frequency="" control="">on cpu7
    p4tcc7: <cpu frequency="" thermal="" control="">on cpu7
    Timecounters tick every 1.000 msec
    IPsec: Initialized Security Association Processing.
    random: unblocking device.
    usbus0: 5.0Gbps Super Speed USB v3.0
    usbus1: 480Mbps High Speed USB v2.0
    ugen1.1: <intel>at usbus1
    uhub0: <intel 1="" 9="" ehci="" root="" hub,="" class="" 0,="" rev="" 2.00="" 1.00,="" addr="">on usbus1
    ugen0.1: <0x1912> at usbus0
    uhub1: <0x1912 XHCI root HUB, class 9/0, rev 3.00/1.00, addr 1> on usbus0
    uhub1: 8 ports with 8 removable, self powered
    uhub0: 8 ports with 8 removable, self powered
    ugen1.2: <vendor 0x8087="">at usbus1
    uhub2: <vendor 2="" 9="" 0x8087="" product="" 0x07db,="" class="" 0,="" rev="" 2.00="" 0.02,="" addr="">on usbus1
    uhub2: 4 ports with 4 removable, self powered
    ugen1.3: <vendor 0x0000="">at usbus1
    uhub3: <vendor 3="" 9="" 0x0000="" product="" 0x0001,="" class="" 0,="" rev="" 2.00="" 0.00,="" addr="">on usbus1
    uhub3: 4 ports with 3 removable, self powered
    ugen1.4: <vendor 0x0557="">at usbus1
    ukbd0: <vendor 0="" 4="" 0x0557="" product="" 0x2419,="" class="" 0,="" rev="" 1.10="" 1.00,="" addr="">on usbus1
    kbd1 at ukbd0
    ada0 at ahcich4 bus 0 scbus4 target 0 lun 0
    ada0: <hgst hts541010a9e680="" ja0oa560="">ATA-8 SATA 3.x device
    ada0: Serial Number JD1000191D6D8N
    ada0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
    ada0: Command Queueing enabled
    ada0: 953869MB (1953525168 512 byte sectors: 16H 63S/T 16383C)
    ada0: Previously was known as ad12
    ada1 at ahcich5 bus 0 scbus5 target 0 lun 0
    ada1: <hgst hts541010a9e680="" ja0oa560="">ATA-8 SATA 3.x device
    ada1: Serial Number JD10001V0U8MRM
    ada1: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
    ada1: Command Queueing enabled
    ada1: 953869MB (1953525168 512 byte sectors: 16H 63S/T 16383C)
    ada1: Previously was known as ad14
    SMP: AP CPU #4 Launched!
    SMP: AP CPU #3 Launched!
    SMP: AP CPU #5 Launched!
    SMP: AP CPU #7 Launched!
    SMP: AP CPU #2 Launched!
    SMP: AP CPU #1 Launched!
    SMP: AP CPU #6 Launched!
    Timecounter "TSC-low" frequency 1200029616 Hz quality 1000
    GEOM_MIRROR: Device mirror/pfSenseMirror launched (2/2).
    Trying to mount root from ufs:/dev/mirror/pfSenseMirrors1a [rw]…
    padlock0: No ACE support.
    aesni0: <aes-cbc,aes-xts,aes-gcm>on motherboard
    lagg0: IPv6 addresses on igb1 have been removed before adding it as a member to prevent IPv6 address scope violation.
    lagg0: link state changed to DOWN
    lagg0: IPv6 addresses on igb4 have been removed before adding it as a member to prevent IPv6 address scope violation.
    lagg0: IPv6 addresses on igb5 have been removed before adding it as a member to prevent IPv6 address scope violation.
    vlan0: changing name to 'lagg0_vlan112'
    vlan1: changing name to 'lagg0_vlan16'
    vlan2: changing name to 'lagg0_vlan15'
    igb3: link state changed to UP
    igb3: link state changed to DOWN
    igb0: promiscuous mode enabled
    carp: demoted by 240 to 240 (interface down)
    igb3: promiscuous mode enabled
    carp: demoted by 240 to 480 (interface down)
    igb5: promiscuous mode enabled
    igb4: promiscuous mode enabled
    igb1: promiscuous mode enabled
    lagg0: promiscuous mode enabled
    carp: demoted by 240 to 720 (interface down)
    lagg0_vlan15: promiscuous mode enabled
    carp: demoted by 240 to 960 (interface down)
    lagg0_vlan16: promiscuous mode enabled
    carp: demoted by 240 to 1200 (interface down)
    lagg0_vlan112: promiscuous mode enabled
    carp: demoted by 240 to 1440 (interface down)
    carp: demoted by 240 to 1680 (pfsync bulk start)
    igb1: link state changed to UP
    carp: VHID 13@lagg0: INIT -> BACKUP
    carp: demoted by -240 to 1440 (interface up)
    lagg0: link state changed to UP
    carp: VHID 15@lagg0_vlan16: INIT -> BACKUP
    carp: demoted by -240 to 1200 (interface up)
    lagg0_vlan16: link state changed to UP
    carp: VHID 16@lagg0_vlan112: INIT -> BACKUP
    carp: demoted by -240 to 960 (interface up)
    lagg0_vlan112: link state changed to UP
    carp: VHID 14@lagg0_vlan15: INIT -> BACKUP
    carp: demoted by -240 to 720 (interface up)
    lagg0_vlan15: link state changed to UP
    igb4: link state changed to UP
    tun1: changing name to 'ovpns1'
    igb5: link state changed to UP
    tun2: changing name to 'ovpnc2'
    tun3: changing name to 'ovpnc3'
    carp: VHID 11@igb0: INIT -> BACKUP
    carp: demoted by -240 to 480 (interface up)
    igb0: link state changed to UP
    pflog0: promiscuous mode enabled
    ovpns1: link state changed to UP
    carp: VHID 12@igb3: INIT -> BACKUP
    carp: demoted by -240 to 240 (interface up)
    igb3: link state changed to UP
    igb2: link state changed to UP
    carp: demoted by -240 to 0 (pfsync bulk done)
    carp: VHID 14@lagg0_vlan15: BACKUP -> MASTER (master down)
    carp: VHID 16@lagg0_vlan112: BACKUP -> MASTER (master down)
    carp: VHID 15@lagg0_vlan16: BACKUP -> MASTER (master down)
    carp: VHID 13@lagg0: BACKUP -> MASTER (master down)
    carp: VHID 11@igb0: BACKUP -> MASTER (preempting a slower master)
    carp: VHID 12@igb3: BACKUP -> MASTER (preempting a slower master)
    ipfw2 (+ipv6) initialized, divert loadable, nat loadable, default to accept, logging disabled
    DUMMYNET 0 with IPv6 initialized (100409)
    load_dn_sched dn_sched FIFO loaded
    load_dn_sched dn_sched QFQ loaded
    load_dn_sched dn_sched RR loaded
    load_dn_sched dn_sched WF2Q+ loaded
    load_dn_sched dn_sched PRIO loaded
    ovpnc3: link state changed to UP
    ovpnc2: link state changed to UP
    ovpns1: link state changed to DOWN
    ovpns1: link state changed to UP
    ovpns1: link state changed to DOWN
    ovpns1: link state changed to UP</aes-cbc,aes-xts,aes-gcm></hgst></hgst></vendor></vendor></vendor></vendor></vendor></vendor></intel></intel></cpu></enhanced></cpu></enhanced></cpu></enhanced></cpu></enhanced></cpu></enhanced></cpu></enhanced></cpu></enhanced></cpu></enhanced></generic></system></isa></isa></pci-isa></ahci></ahci></intel></ahci></ahci></ahci></ahci></intel></intel></intel(r)></intel(r)></intel(r)></intel(r)></processor></intel(r)></intel(r)></acpi></acpi></xhci></acpi></acpi></vga-compatible></acpi></acpi></acpi></acpi></acpi></acpi></at></at></high></acpi></acpi></acpi></acpi></acpi></acpi></acpi></acpi></alaska></software></software,></version></intel ></tscadj,smep,erms></lahf,prefetch></syscall,nx,rdtscp,lm></sse3,pclmulqdq,dtes64,mon,ds_cpl,vmx,est,tm2,ssse3,cx16,xtpr,pdcm,sse4.1,sse4.2,movbe,popcnt,tscdlt,aesni,rdrand></fpu,vme,de,pse,tsc,msr,pae,mce,cx8,apic,sep,mtrr,pge,mca,cmov,pat,pse36,clflush,dts,acpi,mmx,fxsr,sse,sse2,ss,htt,tm,pbe>


  • Hi Michael

    I don't see any obvious problems with what you provided though I'm not familiar with using AVM Fritzbox modems/routers to connect to DSL or CATV.

    I've not been able to test behaviour since last week but my problem appears resolved after resetting net.inet.carp.demotion to 0. The VIPs have remained stable on fw1 and the warnings have disappeared from the CARP status page.

    Can you paste the output of the following:

    sysctl net.inet.carp

    In my case, net.inet.carp.demotion was 240, which should have been 0 if everything was OK. I wonder if somebody enabled persistent maintenance mode, but the web GUI didn't reflect that.

    If net.inet.carp.demotion isn't 0 for you (in your dmesg output, it looks like the last value was 0), reset it to 0 with a negative value for however much it off from 0. In my case that was 240, so I used:

    sysctl net.inet.carp.demotion=-240

    If yours is 480 for example, use -480.

    I intend to test failover behaviour before confirming my issues are resolved.


  • Michael: I think the root of your problem is with lagg. Similar to this:
    https://lists.freebsd.org/pipermail/freebsd-net/2015-January/040813.html

    2.2.1 will default to net.inet.carp.senderr_demotion_factor=0 for this reason. We didn't see anything where this would offer any benefits for our use cases, and it definitely fixes a potential issue there with lagg.

    You can set that in system tunables in the mean time for the same end result.


  • Dear Christopher,

    Thank you very much!! Adding the tunable did solve the problem. I rebooted eight times and I experienced no more split brain situations. As with 2.1.5, the machine designated as CARP master was master for all networks after all reboots as long it was on. Before adding the tunable, I needed to reboot about eight times to end up without a split brain situation.

    I did make two more observations which may be relevant:

    • One of my pair of firewalls is connected to a stacked switch. Of the LAGG with three members, two cables are connected to one switch in the stack and one to the other switch. In that setting, CARP issues did occur more frequently without the tunable. Maybe, the switch interfaces are coming up and down slightly slower due to stack coordination. At the other pair of my firewalls, all three LAGG member cables go to the same switch, as there is only one due to rack space limitations. There, split brain situations did occur without the tunable, but less frequently.
    • After adding the tunable, starting quagga did not work on the backup switch one time, but without practical consequences. Other than that, also starting and stopping quagga does work again after adding the tunable.

    In general, I feel that a human readable text about CARP changes in 2.2 similar to the examples in the draft 2.1 book would be very helpful. For example, I am still banging my head to get captive portal running on a CARP / LAGG interface again after upgrading to 2.2 (https://forum.pfsense.org/index.php?topic=87991.msg495896#msg495896). Without understanding the changes, that is hard to do.

    Regards,

    Michael