CARP not working after upgrade from 2.1.5 to 2.2 II
-
Dear All,
After posting https://forum.pfsense.org/index.php?topic=87633.0 some weeks ago, I did upgrade the other end of my dual-SOHO VPN setting from pfSense 2.1.5 to 2.2. Most things work, for example increasing the OpenVPN auth diggest quality (thanks!).
The machines upgraded now are also two supermicro Intel(R) Atom(TM) CPU C2758 with LAGG interfaces. Before upgrading, everything did work. After upgrading, I am facing issues similar to my own post above and ads76's post https://forum.pfsense.org/index.php?topic=89085.0
The issues are:
-
While one machine reliably becomes master and the other one backup per interface, there is no systematic control anymore, which machine is master and which is backup. This used to work via the skew. This is problematic, as HA sync is directed from master to backup.
-
Furthermore, split brain situations can occur in which one machine is master for the two WAN interfaces and the other machine for all LAN interfaces or maybe for just one of them. This goes away after rebooting machines – but that is not the idea of high availability.
In my case, it is not an issue of connecting the sync interfaces with a straight cable or with a crossed cable. I seem to be getting the same results regardless. MBUF looks normal (27 % used). Also syncing seems to work in general, at least pinging via sync interface works, configuration changes are synced and the state table size looks similar on both machines (cannot track exactly as the browser does not refresh momentarily).
Is there any advice, please?
Regards,
Michael
-
-
Hi Michael
It does sound like we have the same issue. Yours is a 2.2 upgrade too, are you also using LAGG interfaces?
As I haven't solved my issue, I can't help you with you, other than suggest what I've already done to try diagnose the issue. Hopefully, you'll resolve yours using some of the things I've tried. Note you'll need to be logged in using SSH to do most of this stuff. Be very careful if you're not familiar on the the command line.
Take a look here to tune your NICs, note many of these go in /boot/loader.conf.local. You have to create it if it doesn't exist. You should put any edits in there rather than /boot/loader.conf.
https://doc.pfsense.org/index.php/Tuning_and_Troubleshooting_Network_Cards
Check that CARP preempt is enabled, it should be (1) by default. If it's 0, that would explain why your addresses are getting split across the firewalls when a problem is detected, though mine is 1 and it still happens sometimes.
sysctl net.inet.carp.preempt
Secondly, what mode are your interfaces connected in? Autoselect? What about the other end? You can see this info in the web GUI too.
ifconfig igb0
Do that for all of your NICs. Take a look at your upstream and downstream devices too. Ask your colo provider if your upstream devices aren't managed by you.
Can you see any packet errors on your interfaces?
netstat -idb -I igb0
Take a look on all of your NICs. Packet errors and collisions are a sign that your network links are not OK. It might be a loose cable, it might be a duplex mismatch between your firewalls and the switch you're connected to.
Take a look a dmesg to see what happens to your CARP VIPs, this might highlight any cabling issues if you see interfaces going up or down.
dmesg | less
Do a tcpdump on both firewalls simultaneously, listening for CARP packets, while you disable CARP on your current master, then re-enable it, disable on the other, then re-enable. You will see whether each side sends and receives advertisements and also outputting to text file will allow you to review it:
using tcpdump -i lagg0 -n proto CARP | tee -a /root/carp.txt
(CTRL-c to stop.) Output to screen seems slow while also outputting to file. Use less /root/carp.txt to reach your output files (q to quit).
Review dmesg again to see what happened.
Hope that helps you a little, though if we have the same problem I suspect it won't :-\
-
Additional thought. I've noticed a great variation in ping response times between the firewalls over the WAN link (ours communicate on the WAN side via the colo providers upstream switch ports). The colo provider uses HSRP to provide a VIP gateway address, but their switchports have addresses of their own too, much like CARP.
I noticed that there's great variation in response times between our firewalls over the WAN and also between the firewalls and the upstream gateway address. They are much slower on average than on our LAN side by around 100 times. Additionally, one of the upstream ports has the same response time range and variation as the gateway address, while the other is much faster and more consistent, not too far off of our LAN response times. I have asked our colo provider to investigate. You wish to try the same investigation yourself:
- Ping fw2 from fw1 and fw1 from fw2 over the WAN address
- Ping from fw1 and fw2 to your upstream gateway address
- If your upstream device provides individual addresses plus a gateway VIP, ping each of them from both firewalls and note any differences.
You may find that the issue lies with slow and unreliable response times from your upstream device. I could be wrong but I have yet to hear from the colo provider.
-
Dear Ads76,
Thank you very much for your suggestions. I could not resolve my issue either. My findings are:
1. LAGG0 used for the LAN interface and all VLANS running tagged on that interface.
2. I have IPV6 allowed in system advanced in general but not configured on any interface.
3. net.inet.carp.preemt=1 active in systemctl. To be on the safe side, added to etc/sysctl.conf also, but with no effect: Split brains do uccur from time to time.
4. boot/loader.conf as stated here: https://forum.pfsense.org/index.php?topic=89085.msg493521#msg493521 For igb NICs, hw.pci.enable_msix=0 is not mentioned in the tuning guide. Therefore, I did not use that parameter.
5. MBUF usage constantly at 5 %
6. Interfaces autoselect, but all at 1000baseT full-duplex
7. netstat erros and collicions at zero on all interfaces
8. dmesg | less for one of the servers appended below.
9. tcpdump of CARP packets show a number of multicast packets for example on the LAN interface when disabling and enabling CARP. Status CARP frequently shows "CARP has detected a problem and this unit has been demoted to BACKUP status. Check link status on all interfaces with configured CARP VIPs." Nevertheless, the interfaces may well be shown as master.
10. pinging upstream is no problem. This is probably because the servers have dual WAN and in front of each server there is one AVM Fritzbox modem/router to connect the pfSense server to DSL or to CATV. Thus, pinging these devices right in front of the server is quick.
Any further advice would be highly welcome.
Regards,
Michael
$ dmesg | less
Copyright1992-2014 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 10.1-RELEASE-p4 #0 36d7dec(releng/10.1)-dirty: Thu Jan 22 15:12:35 CST 2015
root@pfsense-22-amd64-builder:/usr/obj.amd64/usr/pfSensesrc/src/sys/pfSense_SMP.10 amd64
FreeBSD clang version 3.4.1 (tags/RELEASE_34/dot1-final 208032) 20140512
CPU: Intel(R) Atom(TM) CPU C2758 @ 2.40GHz (2400.06-MHz K8-class CPU)
Origin = "GenuineIntel" Id = 0x406d8 Family = 0x6 Model = 0x4d Stepping = 8
Features=0xbfebfbff <fpu,vme,de,pse,tsc,msr,pae,mce,cx8,apic,sep,mtrr,pge,mca,cmov,pat,pse36,clflush,dts,acpi,mmx,fxsr,sse,sse2,ss,htt,tm,pbe>Features2=0x43d8e3bf <sse3,pclmulqdq,dtes64,mon,ds_cpl,vmx,est,tm2,ssse3,cx16,xtpr,pdcm,sse4.1,sse4.2,movbe,popcnt,tscdlt,aesni,rdrand>AMD Features=0x28100800 <syscall,nx,rdtscp,lm>AMD Features2=0x101 <lahf,prefetch>Structured Extended Features=0x2282 <tscadj,smep,erms>VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID
TSC: P-state invariant, performance statistics
real memory = 17179869184 (16384 MB)
avail memory = 16567734272 (15800 MB)
Event timer "LAPIC" quality 600
ACPI APIC Table: <intel tiano ="">FreeBSD/SMP: Multiprocessor System Detected: 8 CPUs
FreeBSD/SMP: 1 package(s) x 8 core(s)
cpu0 (BSP): APIC ID: 0
cpu1 (AP): APIC ID: 2
cpu2 (AP): APIC ID: 4
cpu3 (AP): APIC ID: 6
cpu4 (AP): APIC ID: 8
cpu5 (AP): APIC ID: 10
cpu6 (AP): APIC ID: 12
cpu7 (AP): APIC ID: 14
ACPI BIOS Warning (bug): Invalid length for FADT/Pm1aControlBlock: 32, using default 16 (20130823/tbfadt-682)
ioapic0 <version 2.0="">irqs 0-23 on motherboard
wlan: mac acl policy registered
ipw_bss: You need to read the LICENSE file in /usr/share/doc/legal/intel_ipw/.
ipw_bss: If you agree with the license, set legal.intel_ipw.license_ack=1 in /boot/loader.conf.
module_register_init: MOD_LOAD (ipw_bss_fw, 0xffffffff80606c30, 0) error 1
ipw_ibss: You need to read the LICENSE file in /usr/share/doc/legal/intel_ipw/.
ipw_ibss: If you agree with the license, set legal.intel_ipw.license_ack=1 in /boot/loader.conf.
module_register_init: MOD_LOAD (ipw_ibss_fw, 0xffffffff80606ce0, 0) error 1
ipw_monitor: You need to read the LICENSE file in /usr/share/doc/legal/intel_ipw/.
ipw_monitor: If you agree with the license, set legal.intel_ipw.license_ack=1 in /boot/loader.conf.
module_register_init: MOD_LOAD (ipw_monitor_fw, 0xffffffff80606d90, 0) error 1
iwi_bss: You need to read the LICENSE file in /usr/share/doc/legal/intel_iwi/.
iwi_bss: If you agree with the license, set legal.intel_iwi.license_ack=1 in /boot/loader.conf.
module_register_init: MOD_LOAD (iwi_bss_fw, 0xffffffff8062e400, 0) error 1
iwi_ibss: You need to read the LICENSE file in /usr/share/doc/legal/intel_iwi/.
iwi_ibss: If you agree with the license, set legal.intel_iwi.license_ack=1 in /boot/loader.conf.
module_register_init: MOD_LOAD (iwi_ibss_fw, 0xffffffff8062e4b0, 0) error 1
iwi_monitor: You need to read the LICENSE file in /usr/share/doc/legal/intel_iwi/.
iwi_monitor: If you agree with the license, set legal.intel_iwi.license_ack=1 in /boot/loader.conf.
module_register_init: MOD_LOAD (iwi_monitor_fw, 0xffffffff8062e560, 0) error 1
random: <software, yarrow="">initialized
module_register_init: MOD_LOAD (vesa, 0xffffffff80fb8b00, 0) error 19
kbd0 at kbdmux0
cryptosoft0: <software crypto="">on motherboard
padlock0: No ACE support.
acpi0: <alaska a="" m="" i="">on motherboard
acpi0: Power Button (fixed)
cpu0: <acpi cpu="">on acpi0
cpu1: <acpi cpu="">on acpi0
cpu2: <acpi cpu="">on acpi0
cpu3: <acpi cpu="">on acpi0
cpu4: <acpi cpu="">on acpi0
cpu5: <acpi cpu="">on acpi0
cpu6: <acpi cpu="">on acpi0
cpu7: <acpi cpu="">on acpi0
hpet0: <high precision="" event="" timer="">iomem 0xfed00000-0xfed003ff on acpi0
Timecounter "HPET" frequency 14318180 Hz quality 950
Event timer "HPET" frequency 14318180 Hz quality 350
Event timer "HPET1" frequency 14318180 Hz quality 340
Event timer "HPET2" frequency 14318180 Hz quality 340
atrtc0: <at realtime="" clock="">port 0x70-0x77 irq 8 on acpi0
atrtc0: Warning: Couldn't map I/O.
Event timer "RTC" frequency 32768 Hz quality 0
attimer0: <at timer="">port 0x40-0x43,0x50-0x53 irq 0 on acpi0
Timecounter "i8254" frequency 1193182 Hz quality 0
Event timer "i8254" frequency 1193182 Hz quality 100
Timecounter "ACPI-fast" frequency 3579545 Hz quality 900
acpi_timer0: <24-bit timer at 3.579545MHz> port 0x408-0x40b on acpi0
pcib0: <acpi host-pci="" bridge="">port 0xcf8-0xcff on acpi0
pci0: <acpi pci="" bus="">on pcib0
pcib1: <acpi pci-pci="" bridge="">mem 0xdf6e0000-0xdf6fffff irq 16 at device 1.0 on pci0
pci1: <acpi pci="" bus="">on pcib1
pcib2: <acpi pci-pci="" bridge="">at device 0.0 on pci1
pci2: <acpi pci="" bus="">on pcib2
vgapci0: <vga-compatible display="">port 0xd000-0xd07f mem 0xde000000-0xdeffffff,0xdf000000-0xdf01ffff irq 16 at device 0.0 on pci2
vgapci0: Boot video device
pcib3: <acpi pci-pci="" bridge="">mem 0xdf6c0000-0xdf6dffff irq 16 at device 2.0 on pci0
pci3: <acpi pci="" bus="">on pcib3
xhci0: <xhci (generic)="" usb="" 3.0="" controller="">mem 0xdf500000-0xdf501fff irq 17 at device 0.0 on pci3
xhci0: 64 byte context size.
usbus0 on xhci0
pcib4: <acpi pci-pci="" bridge="">mem 0xdf6a0000-0xdf6bffff irq 20 at device 3.0 on pci0
pci4: <acpi pci="" bus="">on pcib4
igb0: <intel(r) 1000="" pro="" network="" connection="" version="" -="" 2.4.0="">port 0xc020-0xc03f mem 0xdf200000-0xdf2fffff,0xdf404000-0xdf407fff irq 22 at device 0.0 on pci4
igb0: Using MSIX interrupts with 9 vectors
igb0: Bound queue 0 to cpu 0
igb0: Bound queue 1 to cpu 1
igb0: Bound queue 2 to cpu 2
igb0: Bound queue 3 to cpu 3
igb0: Bound queue 4 to cpu 4
igb0: Bound queue 5 to cpu 5
igb0: Bound queue 6 to cpu 6
igb0: Bound queue 7 to cpu 7
igb1: <intel(r) 1000="" pro="" network="" connection="" version="" -="" 2.4.0="">port 0xc000-0xc01f mem 0xdf100000-0xdf1fffff,0xdf400000-0xdf403fff irq 23 at device 0.1 on pci4
igb1: Using MSIX interrupts with 9 vectors
igb1: Bound queue 0 to cpu 0
igb1: Bound queue 1 to cpu 1
igb1: Bound queue 2 to cpu 2
igb1: Bound queue 3 to cpu 3
igb1: Bound queue 4 to cpu 4
igb1: Bound queue 5 to cpu 5
igb1: Bound queue 6 to cpu 6
igb1: Bound queue 7 to cpu 7
pci0: <processor>at device 11.0 (no driver attached)
pci0: <base peripheral,="" iommu=""> at device 15.0 (no driver attached)
igb2: <intel(r) 1000="" pro="" network="" connection="" version="" -="" 2.4.0="">port 0xe0c0-0xe0df mem 0xdf660000-0xdf67ffff,0xdf70c000-0xdf70ffff irq 20 at device 20.0 on pci0
igb2: Using MSIX interrupts with 9 vectors
igb2: Bound queue 0 to cpu 0
igb2: Bound queue 1 to cpu 1
igb2: Bound queue 2 to cpu 2
igb2: Bound queue 3 to cpu 3
igb2: Bound queue 4 to cpu 4
igb2: Bound queue 5 to cpu 5
igb2: Bound queue 6 to cpu 6
igb2: Bound queue 7 to cpu 7
igb3: <intel(r) 1000="" pro="" network="" connection="" version="" -="" 2.4.0="">port 0xe0a0-0xe0bf mem 0xdf640000-0xdf65ffff,0xdf708000-0xdf70bfff irq 21 at device 20.1 on pci0
igb3: Using MSIX interrupts with 9 vectors
igb3: Bound queue 0 to cpu 0
igb3: Bound queue 1 to cpu 1
igb3: Bound queue 2 to cpu 2
igb3: Bound queue 3 to cpu 3
igb3: Bound queue 4 to cpu 4
igb3: Bound queue 5 to cpu 5
igb3: Bound queue 6 to cpu 6
igb3: Bound queue 7 to cpu 7
igb4: <intel(r) 1000="" pro="" network="" connection="" version="" -="" 2.4.0="">port 0xe080-0xe09f mem 0xdf620000-0xdf63ffff,0xdf704000-0xdf707fff irq 22 at device 20.2 on pci0
igb4: Using MSIX interrupts with 9 vectors
igb4: Bound queue 0 to cpu 0
igb4: Bound queue 1 to cpu 1
igb4: Bound queue 2 to cpu 2
igb4: Bound queue 3 to cpu 3
igb4: Bound queue 4 to cpu 4
igb4: Bound queue 5 to cpu 5
igb4: Bound queue 6 to cpu 6
igb4: Bound queue 7 to cpu 7
igb5: <intel(r) 1000="" pro="" network="" connection="" version="" -="" 2.4.0="">port 0xe060-0xe07f mem 0xdf600000-0xdf61ffff,0xdf700000-0xdf703fff irq 23 at device 20.3 on pci0
igb5: Using MSIX interrupts with 9 vectors
igb5: Bound queue 0 to cpu 0
igb5: Bound queue 1 to cpu 1
igb5: Bound queue 2 to cpu 2
igb5: Bound queue 3 to cpu 3
igb5: Bound queue 4 to cpu 4
igb5: Bound queue 5 to cpu 5
igb5: Bound queue 6 to cpu 6
igb5: Bound queue 7 to cpu 7
ehci0: <intel avoton="" usb="" 2.0="" controller="">mem 0xdf717000-0xdf7173ff irq 23 at device 22.0 on pci0
usbus1: EHCI version 1.0
usbus1 on ehci0
ahci0: <intel avoton="" ahci="" sata="" controller="">port 0xe150-0xe157,0xe140-0xe143,0xe130-0xe137,0xe120-0xe123,0xe040-0xe05f mem 0xdf716000-0xdf7167ff irq 19 at device 23.0 on pci0
ahci0: AHCI v1.30 with 4 3Gbps ports, Port Multiplier not supported
ahcich0: <ahci channel="">at channel 0 on ahci0
ahcich1: <ahci channel="">at channel 1 on ahci0
ahcich2: <ahci channel="">at channel 2 on ahci0
ahcich3: <ahci channel="">at channel 3 on ahci0
ahci1: <intel avoton="" ahci="" sata="" controller="">port 0xe110-0xe117,0xe100-0xe103,0xe0f0-0xe0f7,0xe0e0-0xe0e3,0xe020-0xe03f mem 0xdf715000-0xdf7157ff irq 19 at device 24.0 on pci0
ahci1: AHCI v1.30 with 2 6Gbps ports, Port Multiplier not supported
ahcich4: <ahci channel="">at channel 0 on ahci1
ahcich5: <ahci channel="">at channel 1 on ahci1
isab0: <pci-isa bridge="">at device 31.0 on pci0
isa0: <isa bus="">on isab0
uart0: <16550 or compatible> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0
uart1: <16550 or compatible> port 0x2f8-0x2ff irq 3 on acpi0
orm0: <isa option="" roms="">at iomem 0xc0000-0xc7fff,0xc8000-0xc8fff,0xc9000-0xc9fff,0xca000-0xcafff on isa0
sc0: <system console="">at flags 0x100 on isa0
sc0: CGA <16 virtual consoles, flags=0x300>
vga0: <generic isa="" vga="">at port 0x3d0-0x3db iomem 0xb8000-0xbffff on isa0
ppc0: cannot reserve I/O port range
est0: <enhanced speedstep="" frequency="" control="">on cpu0
p4tcc0: <cpu frequency="" thermal="" control="">on cpu0
est1: <enhanced speedstep="" frequency="" control="">on cpu1
p4tcc1: <cpu frequency="" thermal="" control="">on cpu1
est2: <enhanced speedstep="" frequency="" control="">on cpu2
p4tcc2: <cpu frequency="" thermal="" control="">on cpu2
est3: <enhanced speedstep="" frequency="" control="">on cpu3
p4tcc3: <cpu frequency="" thermal="" control="">on cpu3
est4: <enhanced speedstep="" frequency="" control="">on cpu4
p4tcc4: <cpu frequency="" thermal="" control="">on cpu4
est5: <enhanced speedstep="" frequency="" control="">on cpu5
p4tcc5: <cpu frequency="" thermal="" control="">on cpu5
est6: <enhanced speedstep="" frequency="" control="">on cpu6
p4tcc6: <cpu frequency="" thermal="" control="">on cpu6
est7: <enhanced speedstep="" frequency="" control="">on cpu7
p4tcc7: <cpu frequency="" thermal="" control="">on cpu7
Timecounters tick every 1.000 msec
IPsec: Initialized Security Association Processing.
random: unblocking device.
usbus0: 5.0Gbps Super Speed USB v3.0
usbus1: 480Mbps High Speed USB v2.0
ugen1.1: <intel>at usbus1
uhub0: <intel 1="" 9="" ehci="" root="" hub,="" class="" 0,="" rev="" 2.00="" 1.00,="" addr="">on usbus1
ugen0.1: <0x1912> at usbus0
uhub1: <0x1912 XHCI root HUB, class 9/0, rev 3.00/1.00, addr 1> on usbus0
uhub1: 8 ports with 8 removable, self powered
uhub0: 8 ports with 8 removable, self powered
ugen1.2: <vendor 0x8087="">at usbus1
uhub2: <vendor 2="" 9="" 0x8087="" product="" 0x07db,="" class="" 0,="" rev="" 2.00="" 0.02,="" addr="">on usbus1
uhub2: 4 ports with 4 removable, self powered
ugen1.3: <vendor 0x0000="">at usbus1
uhub3: <vendor 3="" 9="" 0x0000="" product="" 0x0001,="" class="" 0,="" rev="" 2.00="" 0.00,="" addr="">on usbus1
uhub3: 4 ports with 3 removable, self powered
ugen1.4: <vendor 0x0557="">at usbus1
ukbd0: <vendor 0="" 4="" 0x0557="" product="" 0x2419,="" class="" 0,="" rev="" 1.10="" 1.00,="" addr="">on usbus1
kbd1 at ukbd0
ada0 at ahcich4 bus 0 scbus4 target 0 lun 0
ada0: <hgst hts541010a9e680="" ja0oa560="">ATA-8 SATA 3.x device
ada0: Serial Number JD1000191D6D8N
ada0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada0: Command Queueing enabled
ada0: 953869MB (1953525168 512 byte sectors: 16H 63S/T 16383C)
ada0: Previously was known as ad12
ada1 at ahcich5 bus 0 scbus5 target 0 lun 0
ada1: <hgst hts541010a9e680="" ja0oa560="">ATA-8 SATA 3.x device
ada1: Serial Number JD10001V0U8MRM
ada1: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada1: Command Queueing enabled
ada1: 953869MB (1953525168 512 byte sectors: 16H 63S/T 16383C)
ada1: Previously was known as ad14
SMP: AP CPU #4 Launched!
SMP: AP CPU #3 Launched!
SMP: AP CPU #5 Launched!
SMP: AP CPU #7 Launched!
SMP: AP CPU #2 Launched!
SMP: AP CPU #1 Launched!
SMP: AP CPU #6 Launched!
Timecounter "TSC-low" frequency 1200029616 Hz quality 1000
GEOM_MIRROR: Device mirror/pfSenseMirror launched (2/2).
Trying to mount root from ufs:/dev/mirror/pfSenseMirrors1a [rw]…
padlock0: No ACE support.
aesni0: <aes-cbc,aes-xts,aes-gcm>on motherboard
lagg0: IPv6 addresses on igb1 have been removed before adding it as a member to prevent IPv6 address scope violation.
lagg0: link state changed to DOWN
lagg0: IPv6 addresses on igb4 have been removed before adding it as a member to prevent IPv6 address scope violation.
lagg0: IPv6 addresses on igb5 have been removed before adding it as a member to prevent IPv6 address scope violation.
vlan0: changing name to 'lagg0_vlan112'
vlan1: changing name to 'lagg0_vlan16'
vlan2: changing name to 'lagg0_vlan15'
igb3: link state changed to UP
igb3: link state changed to DOWN
igb0: promiscuous mode enabled
carp: demoted by 240 to 240 (interface down)
igb3: promiscuous mode enabled
carp: demoted by 240 to 480 (interface down)
igb5: promiscuous mode enabled
igb4: promiscuous mode enabled
igb1: promiscuous mode enabled
lagg0: promiscuous mode enabled
carp: demoted by 240 to 720 (interface down)
lagg0_vlan15: promiscuous mode enabled
carp: demoted by 240 to 960 (interface down)
lagg0_vlan16: promiscuous mode enabled
carp: demoted by 240 to 1200 (interface down)
lagg0_vlan112: promiscuous mode enabled
carp: demoted by 240 to 1440 (interface down)
carp: demoted by 240 to 1680 (pfsync bulk start)
igb1: link state changed to UP
carp: VHID 13@lagg0: INIT -> BACKUP
carp: demoted by -240 to 1440 (interface up)
lagg0: link state changed to UP
carp: VHID 15@lagg0_vlan16: INIT -> BACKUP
carp: demoted by -240 to 1200 (interface up)
lagg0_vlan16: link state changed to UP
carp: VHID 16@lagg0_vlan112: INIT -> BACKUP
carp: demoted by -240 to 960 (interface up)
lagg0_vlan112: link state changed to UP
carp: VHID 14@lagg0_vlan15: INIT -> BACKUP
carp: demoted by -240 to 720 (interface up)
lagg0_vlan15: link state changed to UP
igb4: link state changed to UP
tun1: changing name to 'ovpns1'
igb5: link state changed to UP
tun2: changing name to 'ovpnc2'
tun3: changing name to 'ovpnc3'
carp: VHID 11@igb0: INIT -> BACKUP
carp: demoted by -240 to 480 (interface up)
igb0: link state changed to UP
pflog0: promiscuous mode enabled
ovpns1: link state changed to UP
carp: VHID 12@igb3: INIT -> BACKUP
carp: demoted by -240 to 240 (interface up)
igb3: link state changed to UP
igb2: link state changed to UP
carp: demoted by -240 to 0 (pfsync bulk done)
carp: VHID 14@lagg0_vlan15: BACKUP -> MASTER (master down)
carp: VHID 16@lagg0_vlan112: BACKUP -> MASTER (master down)
carp: VHID 15@lagg0_vlan16: BACKUP -> MASTER (master down)
carp: VHID 13@lagg0: BACKUP -> MASTER (master down)
carp: VHID 11@igb0: BACKUP -> MASTER (preempting a slower master)
carp: VHID 12@igb3: BACKUP -> MASTER (preempting a slower master)
ipfw2 (+ipv6) initialized, divert loadable, nat loadable, default to accept, logging disabled
DUMMYNET 0 with IPv6 initialized (100409)
load_dn_sched dn_sched FIFO loaded
load_dn_sched dn_sched QFQ loaded
load_dn_sched dn_sched RR loaded
load_dn_sched dn_sched WF2Q+ loaded
load_dn_sched dn_sched PRIO loaded
ovpnc3: link state changed to UP
ovpnc2: link state changed to UP
ovpns1: link state changed to DOWN
ovpns1: link state changed to UP
ovpns1: link state changed to DOWN
ovpns1: link state changed to UP</aes-cbc,aes-xts,aes-gcm></hgst></hgst></vendor></vendor></vendor></vendor></vendor></vendor></intel></intel></cpu></enhanced></cpu></enhanced></cpu></enhanced></cpu></enhanced></cpu></enhanced></cpu></enhanced></cpu></enhanced></cpu></enhanced></generic></system></isa></isa></pci-isa></ahci></ahci></intel></ahci></ahci></ahci></ahci></intel></intel></intel(r)></intel(r)></intel(r)></intel(r)></processor></intel(r)></intel(r)></acpi></acpi></xhci></acpi></acpi></vga-compatible></acpi></acpi></acpi></acpi></acpi></acpi></at></at></high></acpi></acpi></acpi></acpi></acpi></acpi></acpi></acpi></alaska></software></software,></version></intel ></tscadj,smep,erms></lahf,prefetch></syscall,nx,rdtscp,lm></sse3,pclmulqdq,dtes64,mon,ds_cpl,vmx,est,tm2,ssse3,cx16,xtpr,pdcm,sse4.1,sse4.2,movbe,popcnt,tscdlt,aesni,rdrand></fpu,vme,de,pse,tsc,msr,pae,mce,cx8,apic,sep,mtrr,pge,mca,cmov,pat,pse36,clflush,dts,acpi,mmx,fxsr,sse,sse2,ss,htt,tm,pbe> -
Hi Michael
I don't see any obvious problems with what you provided though I'm not familiar with using AVM Fritzbox modems/routers to connect to DSL or CATV.
I've not been able to test behaviour since last week but my problem appears resolved after resetting net.inet.carp.demotion to 0. The VIPs have remained stable on fw1 and the warnings have disappeared from the CARP status page.
Can you paste the output of the following:
sysctl net.inet.carp
In my case, net.inet.carp.demotion was 240, which should have been 0 if everything was OK. I wonder if somebody enabled persistent maintenance mode, but the web GUI didn't reflect that.
If net.inet.carp.demotion isn't 0 for you (in your dmesg output, it looks like the last value was 0), reset it to 0 with a negative value for however much it off from 0. In my case that was 240, so I used:
sysctl net.inet.carp.demotion=-240
If yours is 480 for example, use -480.
I intend to test failover behaviour before confirming my issues are resolved.
-
Michael: I think the root of your problem is with lagg. Similar to this:
https://lists.freebsd.org/pipermail/freebsd-net/2015-January/040813.html2.2.1 will default to net.inet.carp.senderr_demotion_factor=0 for this reason. We didn't see anything where this would offer any benefits for our use cases, and it definitely fixes a potential issue there with lagg.
You can set that in system tunables in the mean time for the same end result.
-
Dear Christopher,
Thank you very much!! Adding the tunable did solve the problem. I rebooted eight times and I experienced no more split brain situations. As with 2.1.5, the machine designated as CARP master was master for all networks after all reboots as long it was on. Before adding the tunable, I needed to reboot about eight times to end up without a split brain situation.
I did make two more observations which may be relevant:
- One of my pair of firewalls is connected to a stacked switch. Of the LAGG with three members, two cables are connected to one switch in the stack and one to the other switch. In that setting, CARP issues did occur more frequently without the tunable. Maybe, the switch interfaces are coming up and down slightly slower due to stack coordination. At the other pair of my firewalls, all three LAGG member cables go to the same switch, as there is only one due to rack space limitations. There, split brain situations did occur without the tunable, but less frequently.
- After adding the tunable, starting quagga did not work on the backup switch one time, but without practical consequences. Other than that, also starting and stopping quagga does work again after adding the tunable.
In general, I feel that a human readable text about CARP changes in 2.2 similar to the examples in the draft 2.1 book would be very helpful. For example, I am still banging my head to get captive portal running on a CARP / LAGG interface again after upgrading to 2.2 (https://forum.pfsense.org/index.php?topic=87991.msg495896#msg495896). Without understanding the changes, that is hard to do.
Regards,
Michael