Long Delays in CARP WAN Failover



  • What is the delay I should expect on a WAN Port Fail?

    When I unplug the LAN from the Primary Node it takes about 10 seconds for the backup Node to resume answering pings on the LAN VIP addresses.  It takes another 3.5mins for the Backup Node to start responding to the WAN VIP.  This was not my expectation can anyone confirm if this is normal?

    Also if I unplug the WAN port only from the Primary Node it takes 70 seconds for the Backup Node to start responding to the WAN VIP.

    Basic config

    em1
    WAN VIP .180
    WAN PF1 .181
    WAN PF2 .182
    Skew PF1    0
    Skew PF2 100

    em0
    LAN VIP  10.41.43.1
    LAN PF1 10.41.43.11
    LAN PF2 10.41.43.12
    Skew PF1    0
    Skew PF2 100

    em0_vlan31
    LAN VIP  10.41.31.1
    LAN PF1 10.41.31.2
    LAN PF2 10.41.31.3
    Skew PF1    0
    Skew PF2 100

    em0_vlan41
    LAN VIP  10.41.41.1
    LAN PF1 10.41.41.2
    LAN PF2 10.41.41.3
    Skew PF1    0
    Skew PF2 100

    em3 - Plugged Directly from one box to another
    HA PF1 10.10.10.1
    HA PF2 10.10.10.2

    2.3.2-RELEASE-p1 (amd64)
    built on Tue Sep 27 12:13:07 CDT 2016
    FreeBSD 10.3-RELEASE-p9


  • Netgate

    Definitely not normal.

    Are those typos on the IP addresses on em0_vlan31 and em0_vlan41?

    What happens when you just enter CARP maintenance mode on the primary in Status > CARP instead of yanking cables?

    I would packet capture on WAN on the secondary during a failover test and recovery and see what's really going on.

    This is a failover and a fail back on a bench WAN setup:

    $ ping 172.25.228.17
    PING 172.25.228.17 (172.25.228.17): 56 data bytes
    64 bytes from 172.25.228.17: icmp_seq=0 ttl=63 time=0.687 ms
    64 bytes from 172.25.228.17: icmp_seq=1 ttl=63 time=0.529 ms
    64 bytes from 172.25.228.17: icmp_seq=2 ttl=63 time=0.407 ms
    64 bytes from 172.25.228.17: icmp_seq=3 ttl=63 time=0.376 ms
    Request timeout for icmp_seq 4
    Request timeout for icmp_seq 5
    64 bytes from 172.25.228.17: icmp_seq=6 ttl=63 time=0.723 ms
    64 bytes from 172.25.228.17: icmp_seq=7 ttl=63 time=0.566 ms
    64 bytes from 172.25.228.17: icmp_seq=8 ttl=63 time=0.562 ms
    64 bytes from 172.25.228.17: icmp_seq=9 ttl=63 time=0.570 ms
    64 bytes from 172.25.228.17: icmp_seq=10 ttl=63 time=0.742 ms
    64 bytes from 172.25.228.17: icmp_seq=11 ttl=63 time=0.459 ms
    64 bytes from 172.25.228.17: icmp_seq=12 ttl=63 time=0.421 ms
    64 bytes from 172.25.228.17: icmp_seq=13 ttl=63 time=0.496 ms
    64 bytes from 172.25.228.17: icmp_seq=14 ttl=63 time=0.465 ms
    64 bytes from 172.25.228.17: icmp_seq=15 ttl=63 time=0.433 ms
    Request timeout for icmp_seq 16
    64 bytes from 172.25.228.17: icmp_seq=17 ttl=63 time=0.590 ms
    64 bytes from 172.25.228.17: icmp_seq=18 ttl=63 time=0.449 ms
    64 bytes from 172.25.228.17: icmp_seq=19 ttl=63 time=0.427 ms
    64 bytes from 172.25.228.17: icmp_seq=20 ttl=63 time=0.440 ms
    64 bytes from 172.25.228.17: icmp_seq=21 ttl=63 time=0.473 ms
    64 bytes from 172.25.228.17: icmp_seq=22 ttl=63 time=0.403 ms
    ^C
    –- 172.25.228.17 ping statistics ---
    23 packets transmitted, 20 packets received, 13.0% packet loss
    round-trip min/avg/max/stddev = 0.376/0.511/0.742/0.105 ms



  • @Derelict:

    Are those typos on the IP addresses on em0_vlan31 and em0_vlan41?

    Yes, they were, I corrected them in my original post now.

    @Derelict:

    What happens when you just enter CARP maintenance mode on the primary in Status > CARP instead of yanking cables?

    It never misses a beat, same thing if I restart the Primary

    @Derelict:

    I would packet capture on WAN on the secondary during a failover test and recovery and see what's really going on.

    I will have to wait until this evening to do this on our live network.

    Any other thoughts or suggestions to look at while I testing further?



  • Sent PM with CAP


  • Netgate

    What version are you running that it isn't decoding CARP as CARP  but VRRP in the packet capture?

    Please do this:

    On the SECONDARY node in BACKUP status, start a packet capture. Using Protocol, select CARP. Set the Count to something like 10000 so it doesn't stop.

    Note the time to the second if possible and unplug WAN on the primary.

    Let it fail over completely.

    Note the time to the second and reconnect WAN on Primary.

    You should see:

    CARP from the primary

    CARP from the primary should stop immediately after you unplug.

    CARP from the secondary should start nearly immediately after that.

    CARP from the primary should start nearly immediately after you plug it back in.

    CARP from the secondary should stop immediately after that.

    Problems like this are almost always something with the switch.

    Also include the system log on both nodes for the same time period.

    Happy to look at the details in a PM. The one you provided didn't really show anything other than CARP from the secondary for 10 seconds.

    It should look something like this:

    07:06:58.037963 IP 172.25.228.18 > 224.0.0.18: CARPv2-advertise 36: vhid=1 advbase=1 advskew=0 authlen=7 counter=17557958983656044908
    07:06:59.039006 IP 172.25.228.18 > 224.0.0.18: CARPv2-advertise 36: vhid=1 advbase=1 advskew=0 authlen=7 counter=17557958983656044908
    07:07:00.040021 IP 172.25.228.18 > 224.0.0.18: CARPv2-advertise 36: vhid=1 advbase=1 advskew=0 authlen=7 counter=17557958983656044908
    07:07:01.040995 IP 172.25.228.18 > 224.0.0.18: CARPv2-advertise 36: vhid=1 advbase=1 advskew=0 authlen=7 counter=17557958983656044908
    07:07:02.041957 IP 172.25.228.18 > 224.0.0.18: CARPv2-advertise 36: vhid=1 advbase=1 advskew=0 authlen=7 counter=17557958983656044908
    07:07:03.043097 IP 172.25.228.18 > 224.0.0.18: CARPv2-advertise 36: vhid=1 advbase=1 advskew=0 authlen=7 counter=17557958983656044908
    07:07:06.434500 IP 172.25.228.19 > 224.0.0.18: CARPv2-advertise 36: vhid=1 advbase=1 advskew=100 authlen=7 counter=15385611900255070985
    07:07:07.826450 IP 172.25.228.19 > 224.0.0.18: CARPv2-advertise 36: vhid=1 advbase=1 advskew=100 authlen=7 counter=15385611900255070985
    07:07:09.218469 IP 172.25.228.19 > 224.0.0.18: CARPv2-advertise 36: vhid=1 advbase=1 advskew=100 authlen=7 counter=15385611900255070985
    07:07:10.610462 IP 172.25.228.19 > 224.0.0.18: CARPv2-advertise 36: vhid=1 advbase=1 advskew=100 authlen=7 counter=15385611900255070985
    07:07:12.002465 IP 172.25.228.19 > 224.0.0.18: CARPv2-advertise 36: vhid=1 advbase=1 advskew=100 authlen=7 counter=15385611900255070985
    07:07:13.394449 IP 172.25.228.19 > 224.0.0.18: CARPv2-advertise 36: vhid=1 advbase=1 advskew=100 authlen=7 counter=15385611900255070985
    07:07:14.786470 IP 172.25.228.19 > 224.0.0.18: CARPv2-advertise 36: vhid=1 advbase=1 advskew=100 authlen=7 counter=15385611900255070985
    07:07:16.178465 IP 172.25.228.19 > 224.0.0.18: CARPv2-advertise 36: vhid=1 advbase=1 advskew=100 authlen=7 counter=15385611900255070985
    07:07:17.570444 IP 172.25.228.19 > 224.0.0.18: CARPv2-advertise 36: vhid=1 advbase=1 advskew=100 authlen=7 counter=15385611900255070985
    07:07:17.571449 IP 172.25.228.18 > 224.0.0.18: CARPv2-advertise 36: vhid=1 advbase=1 advskew=0 authlen=7 counter=7980853803627081614
    07:07:18.571896 IP 172.25.228.18 > 224.0.0.18: CARPv2-advertise 36: vhid=1 advbase=1 advskew=0 authlen=7 counter=7980853803627081614
    07:07:19.572983 IP 172.25.228.18 > 224.0.0.18: CARPv2-advertise 36: vhid=1 advbase=1 advskew=0 authlen=7 counter=7980853803627081614
    07:07:20.573897 IP 172.25.228.18 > 224.0.0.18: CARPv2-advertise 36: vhid=1 advbase=1 advskew=0 authlen=7 counter=7980853803627081614
    07:07:21.575048 IP 172.25.228.18 > 224.0.0.18: CARPv2-advertise 36: vhid=1 advbase=1 advskew=0 authlen=7 counter=7980853803627081614



  • @Derelict:

    What version are you running that it isn't decoding CARP as CARP  but VRRP in the packet capture?

    2.3.2-RELEASE-p1

    I will do the test steps today during lunch and let you know.  What is the best way to get you the log files?



  • Pulled the WAN from Node1 for 30 seconds and interestingly enough it failed over in about 20 seconds which is WAY less time than usual.

    Things I noted from the test Node 1 System log nor the CAP seems to even register the CARP event.

    Attached is the log files.

    CAPS_LOGS.txt


  • Netgate

    The logs from the two nodes aren't from the same time period.



  • Sorry didn't catch that Node1 was wrong it is listed below now for 11:14:59-11:22:09

    The others were right, right?
    Node2 Log 11:20:39-11:21:28
    Node1 CAP 11:20:28-11:21:54
    Node2 CAP 11:20:26-11:21:52

    Nov 10 11:22:09 check_reload_status updating dyndns WANGW
    Nov 10 11:21:41 php-fpm 26622 /rc.carpmaster: HA cluster member "(10.41.41.1@em0_vlan41): (CU_STUDENT_VLAN41)" has resumed CARP state "MASTER" for vhid 13
    Nov 10 11:21:41 php-fpm 26622 /rc.carpmaster: Message sent to jwilhite@raschools.com OK
    Nov 10 11:21:39 php-fpm 18153 /rc.carpmaster: HA cluster member "(67.239.73.180@em1): (WAN)" has resumed CARP state "MASTER" for vhid 10
    Nov 10 11:21:39 php-fpm 18153 /rc.carpmaster: Message sent to jwilhite@raschools.com OK
    Nov 10 11:21:39 php-fpm 26987 /rc.carpmaster: HA cluster member "(10.41.31.1@em0_vlan31): (CU_STAFF_VLAN31)" has resumed CARP state "MASTER" for vhid 12
    Nov 10 11:21:39 php-fpm 26987 /rc.carpmaster: Message sent to jwilhite@raschools.com OK
    Nov 10 11:21:37 php-fpm 24092 /rc.carpmaster: HA cluster member "(10.41.43.1@em0): (LAN)" has resumed CARP state "MASTER" for vhid 11
    Nov 10 11:21:37 php-fpm 24092 /rc.carpmaster: Message sent to jwilhite@raschools.com OK
    Nov 10 11:21:35 xinetd 17858 Reconfigured: new=0 old=23 dropped=0 (services)
    Nov 10 11:21:35 xinetd 17858 readjusting service 19010-udp
    Nov 10 11:21:35 xinetd 17858 readjusting service 19010-tcp
    Nov 10 11:21:35 xinetd 17858 readjusting service 19009-udp
    Nov 10 11:21:35 xinetd 17858 readjusting service 19009-tcp
    Nov 10 11:21:35 xinetd 17858 readjusting service 19008-udp
    Nov 10 11:21:35 xinetd 17858 readjusting service 19008-tcp
    Nov 10 11:21:35 xinetd 17858 readjusting service 19007-udp
    Nov 10 11:21:35 xinetd 17858 readjusting service 19007-tcp
    Nov 10 11:21:35 xinetd 17858 readjusting service 19006-udp
    Nov 10 11:21:35 xinetd 17858 readjusting service 19006-tcp
    Nov 10 11:21:35 xinetd 17858 readjusting service 19005-udp
    Nov 10 11:21:35 xinetd 17858 readjusting service 19005-tcp
    Nov 10 11:21:35 xinetd 17858 readjusting service 19004-udp
    Nov 10 11:21:35 xinetd 17858 readjusting service 19004-tcp
    Nov 10 11:21:35 xinetd 17858 readjusting service 19003-udp
    Nov 10 11:21:35 xinetd 17858 readjusting service 19003-tcp
    Nov 10 11:21:35 xinetd 17858 readjusting service 19002-udp
    Nov 10 11:21:35 xinetd 17858 readjusting service 19002-tcp
    Nov 10 11:21:35 xinetd 17858 readjusting service 19001-udp
    Nov 10 11:21:35 xinetd 17858 readjusting service 19001-tcp
    Nov 10 11:21:35 xinetd 17858 readjusting service 19000-udp
    Nov 10 11:21:35 xinetd 17858 readjusting service 19000-tcp
    Nov 10 11:21:35 xinetd 17858 readjusting service 6969-udp
    Nov 10 11:21:35 xinetd 17858 Swapping defaults
    Nov 10 11:21:35 xinetd 17858 Starting reconfiguration
    Nov 10 11:21:35 check_reload_status Reloading filter
    Nov 10 11:21:35 php-fpm 18153 /rc.newwanip: Configuring CARP settings finalize…
    Nov 10 11:21:35 php-fpm 18153 /rc.newwanip: pfsync done in 0 seconds.
    Nov 10 11:21:35 php-fpm 18153 /rc.newwanip: waiting for pfsync...
    Nov 10 11:21:34 php-fpm 18153 /rc.newwanip: rc.newwanip: on (IP address: 67.239.73.182) (interface: WAN[wan]) (real interface: em1).
    Nov 10 11:21:34 php-fpm 18153 /rc.newwanip: rc.newwanip: Info: starting on em1.
    Nov 10 11:21:34 php-fpm 18153 /rc.carpbackup: HA cluster member "(67.239.73.180@em1): (WAN)" has resumed CARP state "BACKUP" for vhid 10
    Nov 10 11:21:34 php-fpm 18153 /rc.carpbackup: Message sent to jwilhite@raschools.com OK
    Nov 10 11:21:29 kernel ifa_add_loopback_route: insertion failed: 17
    Nov 10 11:21:29 kernel carp: VHID 10@em1: BACKUP -> MASTER (master down)
    Nov 10 11:21:29 check_reload_status Carp master event
    Nov 10 11:21:28 xinetd 17858 Reconfigured: new=0 old=23 dropped=0 (services)
    Nov 10 11:21:27 xinetd 17858 readjusting service 19010-udp
    Nov 10 11:21:27 xinetd 17858 readjusting service 19010-tcp
    Nov 10 11:21:27 xinetd 17858 readjusting service 19009-udp
    Nov 10 11:21:27 xinetd 17858 readjusting service 19009-tcp
    Nov 10 11:21:27 xinetd 17858 readjusting service 19008-udp
    Nov 10 11:21:27 xinetd 17858 readjusting service 19008-tcp
    Nov 10 11:21:27 xinetd 17858 readjusting service 19007-udp
    Nov 10 11:21:27 xinetd 17858 readjusting service 19007-tcp
    Nov 10 11:21:27 xinetd 17858 readjusting service 19006-udp
    Nov 10 11:21:27 xinetd 17858 readjusting service 19006-tcp
    Nov 10 11:21:27 xinetd 17858 readjusting service 19005-udp
    Nov 10 11:21:27 xinetd 17858 readjusting service 19005-tcp
    Nov 10 11:21:27 xinetd 17858 readjusting service 19004-udp
    Nov 10 11:21:27 xinetd 17858 readjusting service 19004-tcp
    Nov 10 11:21:27 xinetd 17858 readjusting service 19003-udp
    Nov 10 11:21:27 xinetd 17858 readjusting service 19003-tcp
    Nov 10 11:21:27 xinetd 17858 readjusting service 19002-udp
    Nov 10 11:21:27 xinetd 17858 readjusting service 19002-tcp
    Nov 10 11:21:27 xinetd 17858 readjusting service 19001-udp
    Nov 10 11:21:27 xinetd 17858 readjusting service 19001-tcp
    Nov 10 11:21:27 xinetd 17858 readjusting service 19000-udp
    Nov 10 11:21:27 xinetd 17858 readjusting service 19000-tcp
    Nov 10 11:21:27 xinetd 17858 readjusting service 6969-udp
    Nov 10 11:21:27 xinetd 17858 Swapping defaults
    Nov 10 11:21:27 xinetd 17858 Starting reconfiguration
    Nov 10 11:21:27 check_reload_status Reloading filter
    Nov 10 11:21:27 check_reload_status rc.newwanip starting em1
    Nov 10 11:21:27 php-fpm 24092 /rc.linkup: Hotplug event detected for WAN(wan) static IP (67.239.73.182 )
    Nov 10 11:21:27 check_reload_status Carp master event
    Nov 10 11:21:27 check_reload_status Carp master event
    Nov 10 11:21:27 check_reload_status Carp master event
    Nov 10 11:21:27 kernel carp: VHID 12@em0_vlan31: BACKUP -> MASTER (preempting a slower master)
    Nov 10 11:21:27 kernel carp: VHID 13@em0_vlan41: BACKUP -> MASTER (preempting a slower master)
    Nov 10 11:21:27 kernel carp: VHID 11@em0: BACKUP -> MASTER (preempting a slower master)
    Nov 10 11:21:26 check_reload_status Reloading filter
    Nov 10 11:21:26 check_reload_status Restarting OpenVPN tunnels/interfaces
    Nov 10 11:21:26 check_reload_status Restarting ipsec tunnels
    Nov 10 11:21:26 check_reload_status updating dyndns WANGW
    Nov 10 11:21:26 check_reload_status Linkup starting em1
    Nov 10 11:21:26 kernel em1: link state changed to UP
    Nov 10 11:21:26 kernel carp: demoted by -240 to 0 (interface up)
    Nov 10 11:21:26 kernel carp: VHID 10@em1: INIT -> BACKUP
    Nov 10 11:21:26 check_reload_status Carp backup event
    Nov 10 11:21:10 php-fpm 17632 /rc.carpbackup: HA cluster member "(10.41.43.1@em0): (LAN)" has resumed CARP state "BACKUP" for vhid 11
    Nov 10 11:21:10 php-fpm 18153 /rc.carpbackup: HA cluster member "(10.41.31.1@em0_vlan31): (CU_STAFF_VLAN31)" has resumed CARP state "BACKUP" for vhid 12
    Nov 10 11:21:10 php-fpm 17941 /rc.carpbackup: HA cluster member "(10.41.41.1@em0_vlan41): (CU_STUDENT_VLAN41)" has resumed CARP state "BACKUP" for vhid 13
    Nov 10 11:21:10 php-fpm 17632 /rc.carpbackup: Message sent to jwilhite@raschools.com OK
    Nov 10 11:21:10 php-fpm 18153 /rc.carpbackup: Message sent to jwilhite@raschools.com OK
    Nov 10 11:21:10 php-fpm 17941 /rc.carpbackup: Message sent to jwilhite@raschools.com OK
    Nov 10 11:21:07 xinetd 17858 Reconfigured: new=0 old=23 dropped=0 (services)
    Nov 10 11:21:07 xinetd 17858 readjusting service 19010-udp
    Nov 10 11:21:07 xinetd 17858 readjusting service 19010-tcp
    Nov 10 11:21:07 xinetd 17858 readjusting service 19009-udp
    Nov 10 11:21:07 xinetd 17858 readjusting service 19009-tcp
    Nov 10 11:21:07 xinetd 17858 readjusting service 19008-udp
    Nov 10 11:21:07 xinetd 17858 readjusting service 19008-tcp
    Nov 10 11:21:07 xinetd 17858 readjusting service 19007-udp
    Nov 10 11:21:07 xinetd 17858 readjusting service 19007-tcp
    Nov 10 11:21:07 xinetd 17858 readjusting service 19006-udp
    Nov 10 11:21:07 xinetd 17858 readjusting service 19006-tcp
    Nov 10 11:21:07 xinetd 17858 readjusting service 19005-udp
    Nov 10 11:21:07 xinetd 17858 readjusting service 19005-tcp
    Nov 10 11:21:07 xinetd 17858 readjusting service 19004-udp
    Nov 10 11:21:07 xinetd 17858 readjusting service 19004-tcp
    Nov 10 11:21:07 xinetd 17858 readjusting service 19003-udp
    Nov 10 11:21:07 xinetd 17858 readjusting service 19003-tcp
    Nov 10 11:21:07 xinetd 17858 readjusting service 19002-udp
    Nov 10 11:21:07 xinetd 17858 readjusting service 19002-tcp
    Nov 10 11:21:07 xinetd 17858 readjusting service 19001-udp
    Nov 10 11:21:07 xinetd 17858 readjusting service 19001-tcp
    Nov 10 11:21:07 xinetd 17858 readjusting service 19000-udp
    Nov 10 11:21:07 xinetd 17858 readjusting service 19000-tcp
    Nov 10 11:21:07 xinetd 17858 readjusting service 6969-udp
    Nov 10 11:21:07 xinetd 17858 Swapping defaults
    Nov 10 11:21:07 xinetd 17858 Starting reconfiguration
    Nov 10 11:21:07 php-fpm 15607 /rc.carpbackup: HA cluster member "(67.239.73.180@em1): (WAN)" has resumed CARP state "BACKUP" for vhid 10
    Nov 10 11:21:07 php-fpm 15607 /rc.carpbackup: Message sent to jwilhite@raschools.com OK
    Nov 10 11:21:00 check_reload_status Reloading filter
    Nov 10 11:21:00 php-fpm 17632 /rc.linkup: Hotplug event detected for WAN(wan) static IP (67.239.73.182 )
    Nov 10 11:20:59 check_reload_status Carp backup event
    Nov 10 11:20:59 check_reload_status Carp backup event
    Nov 10 11:20:59 check_reload_status Carp backup event
    Nov 10 11:20:59 check_reload_status Linkup starting em1
    Nov 10 11:20:59 check_reload_status Carp backup event
    Nov 10 11:20:59 kernel carp: VHID 12@em0_vlan31: MASTER -> BACKUP (more frequent advertisement received)
    Nov 10 11:20:59 kernel carp: VHID 13@em0_vlan41: MASTER -> BACKUP (more frequent advertisement received)
    Nov 10 11:20:59 kernel carp: VHID 11@em0: MASTER -> BACKUP (more frequent advertisement received)
    Nov 10 11:20:59 kernel em1: link state changed to DOWN
    Nov 10 11:20:59 kernel carp: demoted by 240 to 240 (interface down)
    Nov 10 11:20:59 kernel em1: RX Next to Refresh = 348
    Nov 10 11:20:59 kernel em1: RX Next to Check = 349
    Nov 10 11:20:59 kernel em1: RX discarded packets = 0
    Nov 10 11:20:59 kernel em1: hw rdh = 349, hw rdt = 348
    Nov 10 11:20:59 kernel em1: RX Queue 0 –----
    Nov 10 11:20:59 kernel em1: Tx Descriptors avail failure = 2320
    Nov 10 11:20:59 kernel em1: TX descriptors avail = 40
    Nov 10 11:20:59 kernel em1: Tx Queue Status = -2147483648
    Nov 10 11:20:59 kernel em1: hw tdh = 30, hw tdt = 1014
    Nov 10 11:20:59 kernel em1: TX Queue 0 ------
    Nov 10 11:20:59 kernel Interface is RUNNING and ACTIVE
    Nov 10 11:20:59 kernel em1: Watchdog timeout Queue[0]– resetting
    Nov 10 11:20:52 xinetd 17858 Reconfigured: new=0 old=23 dropped=0 (services)
    Nov 10 11:20:52 xinetd 17858 readjusting service 19010-udp
    Nov 10 11:20:52 xinetd 17858 readjusting service 19010-tcp
    Nov 10 11:20:52 xinetd 17858 readjusting service 19009-udp
    Nov 10 11:20:52 xinetd 17858 readjusting service 19009-tcp
    Nov 10 11:20:52 xinetd 17858 readjusting service 19008-udp
    Nov 10 11:20:52 xinetd 17858 readjusting service 19008-tcp
    Nov 10 11:20:52 xinetd 17858 readjusting service 19007-udp
    Nov 10 11:20:52 xinetd 17858 readjusting service 19007-tcp
    Nov 10 11:20:52 xinetd 17858 readjusting service 19006-udp
    Nov 10 11:20:52 xinetd 17858 readjusting service 19006-tcp
    Nov 10 11:20:52 xinetd 17858 readjusting service 19005-udp
    Nov 10 11:20:52 xinetd 17858 readjusting service 19005-tcp
    Nov 10 11:20:52 xinetd 17858 readjusting service 19004-udp
    Nov 10 11:20:52 xinetd 17858 readjusting service 19004-tcp
    Nov 10 11:20:52 xinetd 17858 readjusting service 19003-udp
    Nov 10 11:20:52 xinetd 17858 readjusting service 19003-tcp
    Nov 10 11:20:52 xinetd 17858 readjusting service 19002-udp
    Nov 10 11:20:52 xinetd 17858 readjusting service 19002-tcp
    Nov 10 11:20:52 xinetd 17858 readjusting service 19001-udp
    Nov 10 11:20:52 xinetd 17858 readjusting service 19001-tcp
    Nov 10 11:20:52 xinetd 17858 readjusting service 19000-udp
    Nov 10 11:20:52 xinetd 17858 readjusting service 19000-tcp
    Nov 10 11:20:52 xinetd 17858 readjusting service 6969-udp
    Nov 10 11:20:52 xinetd 17858 Swapping defaults
    Nov 10 11:20:52 xinetd 17858 Starting reconfiguration
    Nov 10 11:20:51 check_reload_status Reloading filter
    Nov 10 11:20:51 check_reload_status Restarting OpenVPN tunnels/interfaces
    Nov 10 11:20:51 check_reload_status Restarting ipsec tunnels
    Nov 10 11:20:51 check_reload_status updating dyndns WANGW
    Nov 10 11:14:59 php-fpm 63346 /widgets/widgets/ipsec.widget.php: Unable to find IPsec daemon leases file. Could not display mobile user stats!


  • Netgate

    You sure you don't need something like spanning-tree portfast on your switch ports?

    This stuff is - seriously - almost always something with the switch.


  • Netgate

    Looks like it's taking quite some time for the system to detect the cable unplug event.

    The secondary loses CARP from the primary an 11:20:39

    Then 20 seconds later you get this on the primary:

    Nov 10 11:20:59  kernel      em1: Watchdog timeout Queue[0]– resetting

    At 11:21:26 the link comes back up. Failing back to master is immediate.

    This is also strange:

    Nov 10 11:21:34  php-fpm  18153  /rc.carpbackup: HA cluster member "(67.239.73.180@em1): (WAN)" has resumed CARP state "BACKUP" for vhid 10
    Nov 10 11:21:34  php-fpm  18153  /rc.carpbackup: Message sent to jwilhite@raschools.com OK
    Nov 10 11:21:29  kernel      ifa_add_loopback_route: insertion failed: 17
    Nov 10 11:21:29  kernel      carp: VHID 10@em1: BACKUP -> MASTER (master down)

    Probably want to capture CARP on primary to see what you see there.



  • Do you have a lot of people unplugging your WAN cables? I've never had a cable failure on a CARP node.
    I've seen- power supply failure, hard disk failure, OS crash… I know you want to test things, but I don't think your test is very likely to mirror an actual failure. Try putting the master in maintenance mode, or powering it off. If that works it should work in a real failure. If you really want to dig into the cable thing, watch the CARP status on each node when you unplug the WAN, preemption should trigger the rest of the CARP VIPs to go into backup after the WAN goes.



  • @Derelict:

    Looks like it's taking quite some time for the system to detect the cable unplug event.

    The secondary loses CARP from the primary an 11:20:39

    Then 20 seconds later you get this on the primary:

    Nov 10 11:20:59  kernel      em1: Watchdog timeout Queue[0]– resetting

    At 11:21:26 the link comes back up. Failing back to master is immediate.

    This is also strange:

    Nov 10 11:21:34  php-fpm  18153  /rc.carpbackup: HA cluster member "(67.239.73.180@em1): (WAN)" has resumed CARP state "BACKUP" for vhid 10
    Nov 10 11:21:34  php-fpm  18153  /rc.carpbackup: Message sent to jwilhite@raschools.com OK
    Nov 10 11:21:29  kernel      ifa_add_loopback_route: insertion failed: 17
    Nov 10 11:21:29  kernel      carp: VHID 10@em1: BACKUP -> MASTER (master down)

    Probably want to capture CARP on primary to see what you see there.

    I uploaded the CAP for CARP from the Primary and Secondary WAN, are you needing it from another PORT?

    Thanks for noting the strange delay, is there a way to reduce the watchdog time?  I am pretty sure my config is pretty standard, I haven't done any tweaking to this install.

    @dotdash:

    Do you have a lot of people unplugging your WAN cables? I've never had a cable failure on a CARP node.
    I've seen- power supply failure, hard disk failure, OS crash… I know you want to test things, but I don't think your test is very likely to mirror an actual failure. Try putting the master in maintenance mode, or powering it off. If that works it should work in a real failure. If you really want to dig into the cable thing, watch the CARP status on each node when you unplug the WAN, preemption should trigger the rest of the CARP VIPs to go into backup after the WAN goes.

    Although I have had a unplug problem ONCE by accident that is not the norm.  My assumption was that from the perspective of the nodes I was testing a single Switch/Port Failure.

    I will try a power failure of Node1 after hours and then the switch power failure Node1 is on.


  • Netgate

    He has already stated that everything works fine going into and out of maintenance mode.

    I was going to also say that this is a very odd failure test. With that hardware that might be the best that it will do. Unplugging a cable should not result in a watchdog timeout. It should result in a link down. Not sure why that is happening.



  • Couldn't do a switch fail too many people left on the network.  However the power fail for the Primary Node1 worked flawlessly, never missed a beat on pings.  I will have to test this weekend or next week for the switch fail test.

    Probably a dumb question, why does pulling the cord not simulate the same failure?

    Below is the CAP and Log from Node2 during the Node1 fail test.

    Node2
    15:50:02.942083 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=10 advbase=1 advskew=100 authlen=7 counter=9060617201846148866
    15:50:02.942125 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=9 advbase=1 advskew=100 authlen=7 counter=10048284322299381040
    15:50:04.333891 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=9 advbase=1 advskew=100 authlen=7 counter=10048284322299381040
    15:50:04.333904 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=10 advbase=1 advskew=100 authlen=7 counter=9060617201846148866
    15:50:05.725931 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=10 advbase=1 advskew=100 authlen=7 counter=9060617201846148866
    15:50:05.725943 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=9 advbase=1 advskew=100 authlen=7 counter=10048284322299381040
    15:50:07.117895 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=9 advbase=1 advskew=100 authlen=7 counter=10048284322299381040
    15:50:07.117909 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=10 advbase=1 advskew=100 authlen=7 counter=9060617201846148866
    15:50:08.509933 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=10 advbase=1 advskew=100 authlen=7 counter=9060617201846148866
    15:50:08.509945 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=9 advbase=1 advskew=100 authlen=7 counter=10048284322299381040
    15:50:09.901886 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=9 advbase=1 advskew=100 authlen=7 counter=10048284322299381040
    15:50:09.901898 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=10 advbase=1 advskew=100 authlen=7 counter=9060617201846148866
    15:50:11.293933 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=10 advbase=1 advskew=100 authlen=7 counter=9060617201846148866
    15:50:11.293945 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=9 advbase=1 advskew=100 authlen=7 counter=10048284322299381040
    15:50:12.696925 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=9 advbase=1 advskew=100 authlen=7 counter=10048284322299381040
    15:50:12.696937 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=10 advbase=1 advskew=100 authlen=7 counter=9060617201846148866
    15:50:14.095942 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=10 advbase=1 advskew=100 authlen=7 counter=9060617201846148866
    15:50:14.095955 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=9 advbase=1 advskew=100 authlen=7 counter=10048284322299381040
    15:50:15.487932 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=9 advbase=1 advskew=100 authlen=7 counter=10048284322299381040
    15:50:15.487944 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=10 advbase=1 advskew=100 authlen=7 counter=9060617201846148866
    15:50:16.894938 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=10 advbase=1 advskew=100 authlen=7 counter=9060617201846148866
    15:50:16.894950 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=9 advbase=1 advskew=100 authlen=7 counter=10048284322299381040
    15:50:18.286944 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=9 advbase=1 advskew=100 authlen=7 counter=10048284322299381040
    15:50:18.286956 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=10 advbase=1 advskew=100 authlen=7 counter=9060617201846148866
    15:50:19.678938 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=10 advbase=1 advskew=100 authlen=7 counter=9060617201846148866
    15:50:19.678950 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=9 advbase=1 advskew=100 authlen=7 counter=10048284322299381040
    15:50:21.070930 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=9 advbase=1 advskew=100 authlen=7 counter=10048284322299381040
    15:50:21.070941 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=10 advbase=1 advskew=100 authlen=7 counter=9060617201846148866
    15:50:22.481941 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=10 advbase=1 advskew=100 authlen=7 counter=9060617201846148866
    15:50:22.481953 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=9 advbase=1 advskew=100 authlen=7 counter=10048284322299381040
    15:50:23.886938 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=9 advbase=1 advskew=100 authlen=7 counter=10048284322299381040
    15:50:23.886955 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=10 advbase=1 advskew=100 authlen=7 counter=9060617201846148866
    15:50:25.287937 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=10 advbase=1 advskew=100 authlen=7 counter=9060617201846148866
    15:50:25.287948 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=9 advbase=1 advskew=100 authlen=7 counter=10048284322299381040
    15:50:26.690950 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=9 advbase=1 advskew=100 authlen=7 counter=10048284322299381040
    15:50:26.690969 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=10 advbase=1 advskew=100 authlen=7 counter=9060617201846148866
    15:50:28.095943 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=10 advbase=1 advskew=100 authlen=7 counter=9060617201846148866
    15:50:28.095960 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=9 advbase=1 advskew=100 authlen=7 counter=10048284322299381040
    15:50:29.493929 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=9 advbase=1 advskew=100 authlen=7 counter=10048284322299381040
    15:50:29.493941 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=10 advbase=1 advskew=100 authlen=7 counter=9060617201846148866
    15:50:30.898938 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=10 advbase=1 advskew=100 authlen=7 counter=9060617201846148866
    15:50:30.898951 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=9 advbase=1 advskew=100 authlen=7 counter=10048284322299381040
    15:50:32.304932 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=9 advbase=1 advskew=100 authlen=7 counter=10048284322299381040
    15:50:32.304945 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=10 advbase=1 advskew=100 authlen=7 counter=9060617201846148866
    15:50:33.703947 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=10 advbase=1 advskew=100 authlen=7 counter=9060617201846148866
    15:50:33.703961 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=9 advbase=1 advskew=100 authlen=7 counter=10048284322299381040
    15:50:35.101930 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=9 advbase=1 advskew=100 authlen=7 counter=10048284322299381040
    15:50:35.101942 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=10 advbase=1 advskew=100 authlen=7 counter=9060617201846148866
    15:50:36.506943 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=10 advbase=1 advskew=100 authlen=7 counter=9060617201846148866
    15:50:36.506955 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=9 advbase=1 advskew=100 authlen=7 counter=10048284322299381040
    15:50:37.913927 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=9 advbase=1 advskew=100 authlen=7 counter=10048284322299381040
    15:50:37.913939 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=10 advbase=1 advskew=100 authlen=7 counter=9060617201846148866
    15:50:39.318943 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=10 advbase=1 advskew=100 authlen=7 counter=9060617201846148866
    15:50:39.318955 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=9 advbase=1 advskew=100 authlen=7 counter=10048284322299381040
    15:50:40.717936 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=9 advbase=1 advskew=100 authlen=7 counter=10048284322299381040
    15:50:40.717954 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=10 advbase=1 advskew=100 authlen=7 counter=9060617201846148866
    15:50:42.109945 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=10 advbase=1 advskew=100 authlen=7 counter=9060617201846148866
    15:50:42.109957 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=9 advbase=1 advskew=100 authlen=7 counter=10048284322299381040
    15:50:43.528942 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=9 advbase=1 advskew=100 authlen=7 counter=10048284322299381040
    15:50:43.528960 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=10 advbase=1 advskew=100 authlen=7 counter=9060617201846148866
    15:50:44.926941 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=10 advbase=1 advskew=100 authlen=7 counter=9060617201846148866
    15:50:44.926953 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=9 advbase=1 advskew=100 authlen=7 counter=10048284322299381040
    15:50:46.325926 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=9 advbase=1 advskew=100 authlen=7 counter=10048284322299381040
    15:50:46.325938 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=10 advbase=1 advskew=100 authlen=7 counter=9060617201846148866
    15:50:47.717949 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=10 advbase=1 advskew=100 authlen=7 counter=9060617201846148866
    15:50:47.717966 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=9 advbase=1 advskew=100 authlen=7 counter=10048284322299381040
    15:50:49.109942 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=9 advbase=1 advskew=100 authlen=7 counter=10048284322299381040
    15:50:49.109959 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=10 advbase=1 advskew=100 authlen=7 counter=9060617201846148866
    15:50:50.507937 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=10 advbase=1 advskew=100 authlen=7 counter=9060617201846148866
    15:50:50.507949 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=9 advbase=1 advskew=100 authlen=7 counter=10048284322299381040
    15:50:51.913932 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=9 advbase=1 advskew=100 authlen=7 counter=10048284322299381040
    15:50:51.913944 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=10 advbase=1 advskew=100 authlen=7 counter=9060617201846148866
    15:50:53.317941 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=10 advbase=1 advskew=100 authlen=7 counter=9060617201846148866
    15:50:53.317953 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=9 advbase=1 advskew=100 authlen=7 counter=10048284322299381040
    15:50:54.717942 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=9 advbase=1 advskew=100 authlen=7 counter=10048284322299381040
    15:50:54.717959 IP 67.239.73.181 > 224.0.0.18: CARPv2-advertise 36: vhid=10 advbase=1 advskew=100 authlen=7 counter=9060617201846148866

    Nov 10 15:51:30 check_reload_status Carp backup event
    Nov 10 15:51:30 kernel carp: VHID 9@em1: MASTER -> BACKUP (more frequent advertisement received)
    Nov 10 15:51:30 kernel carp: VHID 10@em1: MASTER -> BACKUP (more frequent advertisement received)
    Nov 10 15:51:30 check_reload_status Carp backup event
    Nov 10 15:51:26 check_reload_status Carp backup event
    Nov 10 15:51:26 check_reload_status Carp backup event
    Nov 10 15:51:26 kernel carp: VHID 12@em0_vlan31: MASTER -> BACKUP (more frequent advertisement received)
    Nov 10 15:51:26 kernel carp: VHID 13@em0_vlan41: MASTER -> BACKUP (more frequent advertisement received)
    Nov 10 15:51:26 kernel carp: VHID 11@em0: MASTER -> BACKUP (more frequent advertisement received)
    Nov 10 15:51:26 check_reload_status Carp backup event
    Nov 10 15:50:10 php-fpm 30196 /rc.carpmaster: HA cluster member "(67.239.73.180@em1): (WAN)" has resumed CARP state "MASTER" for vhid 10
    Nov 10 15:50:10 php-fpm 30196 /rc.carpmaster: Message sent to admin@raschools.com OK
    Nov 10 15:50:10 php-fpm 91222 /rc.carpmaster: HA cluster member "(67.239.73.178@em1): (WAN)" has resumed CARP state "MASTER" for vhid 9
    Nov 10 15:50:10 php-fpm 91222 /rc.carpmaster: Message sent to admin@raschools.com OK
    Nov 10 15:50:10 php-fpm 29856 /rc.carpmaster: HA cluster member "(10.41.31.1@em0_vlan31): (VLAN31_STAFF)" has resumed CARP state "MASTER" for vhid 12
    Nov 10 15:50:10 php-fpm 29546 /rc.carpmaster: HA cluster member "(10.41.41.1@em0_vlan41): (VLAN41_CHROMEBOOKS)" has resumed CARP state "MASTER" for vhid 13
    Nov 10 15:50:10 php-fpm 29856 /rc.carpmaster: Message sent to admin@raschools.com OK
    Nov 10 15:50:10 php-fpm 29546 /rc.carpmaster: Message sent to admin@raschools.com OK
    Nov 10 15:50:07 php-fpm 91222 /rc.carpmaster: HA cluster member "(10.41.43.1@em0): (LAN)" has resumed CARP state "MASTER" for vhid 11
    Nov 10 15:50:07 php-fpm 91222 /rc.carpmaster: Message sent to admin@raschools.com OK
    Nov 10 15:50:02 check_reload_status Carp master event
    Nov 10 15:50:02 check_reload_status Carp master event
    Nov 10 15:50:02 check_reload_status Carp master event
    Nov 10 15:50:02 check_reload_status Carp master event
    Nov 10 15:50:02 kernel carp: VHID 9@em1: BACKUP -> MASTER (master down)
    Nov 10 15:50:02 kernel carp: VHID 10@em1: BACKUP -> MASTER (master down)
    Nov 10 15:50:02 kernel carp: VHID 12@em0_vlan31: BACKUP -> MASTER (master down)
    Nov 10 15:50:02 kernel carp: VHID 13@em0_vlan41: BACKUP -> MASTER (master down)
    Nov 10 15:50:02 kernel carp: VHID 11@em0: BACKUP -> MASTER (master down)
    Nov 10 15:50:02 check_reload_status Carp master event



  • Hello, I have the same problem with my Pfsense CARP.
    When I pull out the WAN cable, it takes very much time, before PfSense detects the cable failure.
    Are there any ways to solve the problem?
    Thanks



  • If it was the exact same issue I had then the real problem ended up being my testing method.  Although I never understood why yanking a cable isn't a valid test.  When I simply unplug POWER from the primary router the failover is nearly instantaneous.  I never posted a follow up on simulating a switch failure but that also failed over as expected.

    On another note, I have had to enter maintenance mode 5 times recently so that I could change out some other equipment inline with these.  Three out of five the switchover was nearly 30seconds and the WebInterface on both routers locks up completely which switching over.  Looking at top via ssh during the switchover doesn't show anything locked up nor maxing out RAM/CPU so not sure what the problem was there if it is related at all.