Slow failover recovery



  • Hi,

    I am setting up a high-availability setup with two systems.  To facilitate this I have three CARP VIPs.  The basic network is:

    Internal LAN:  10.192.0.0/10
    External WAN: 192.168.230.0/24

    This is a development setup, eventually the WAN will be replaced by the real internet addresses.

    Two firewalls, each has 4 NICS, right now only using 3.  Their info is:

    RA-1

    Internal address:  10.1.2.1/10
    External address: 192.168.230.52/24
    SYNC NIC address:  10.254.1.1/10

    RA-2

    Internal address: 10.1.2.10/10
    External address: 192.168.230.53/24
    SYNC NIC address:  10.254.1.2/10

    The three VIPs are:

    92.168.230.50/24 (vhid 1)      (Nat'd to RA-1)
    10.1.1.1/10 (vhid 2)                (this is the gateway IP)
    192.168.230.51/24 (vhid 10)  (Nat'd to RA-2)

    The passwords on both systems for each VIP are identical.  The advertising frequency for the RA-1 system on all three VIPS is set to 0, and on the RA-2 is set to 1

    The time on both systems are as identical as I can see, they both use the same timeserver.

    Now the problem:

    When testing, I disconnect both LAN and WAN from RA-1, and the failover to RA-2 takes about 3 seconds.  However, after plugging the LAN and WAN back into RA-1, the recovery is much different:

    First, RA-2 has the CARP VIPS go into a backup mode within a few seconds, but RA-1 is non-responsive for 34 seconds, and during those 34 seconds the internal LAN loses connectivity to the outside WAN.  Also, during the recovery, outside connections seem to get broken.

    Also, while this may not be relevent, I am seeing the following error in the system.log on RA-1:

    Jun 9 11:03:23 kernel: arp: 10.10.1.1 moved from 00:15:17:d8:41:20 to 00:15:17:d8:41:22 on igb1
    Jun 9 11:04:00 kernel: arp: 10.10.1.1 moved from 00:15:17:d8:41:22 to 00:15:17:d8:41:20 on igb1

    At one point during my testing I had tried spoofing the MAC address on igb1, but removed that when that part of my testing was done.  Is it possible that it is still trying to spoof the MAC address and that is what is slowing it down?

    Any ideas?

    Thanks in advance.

    JBB


  • Rebel Alliance Developer Netgate

    The ARP is a little odd, you shouldn't see messages like that if it is working normally.

    You might try (as a test) to see if making the backup box's skew higher, perhaps 50, makes any difference.

    During that 30+ seconds of limbo, what do the CARP IPs on the master box show?

    This may be a wild suggestion, but it's worth considering: The backup/master change also relies upon the switch, and that 30-second timer also sounds like it could possibly be STP related on a managed switch, that's about how long it takes a port to switch from blocking to forwarding. If you have a managed switch, try disabling STP on the ports involved (if they're Cisco, enable portfast)



  • @jimp:

    The ARP is a little odd, you shouldn't see messages like that if it is working normally.

    You might try (as a test) to see if making the backup box's skew higher, perhaps 50, makes any difference.

    During that 30+ seconds of limbo, what do the CARP IPs on the master box show?

    This may be a wild suggestion, but it's worth considering: The backup/master change also relies upon the switch, and that 30-second timer also sounds like it could possibly be STP related on a managed switch, that's about how long it takes a port to switch from blocking to forwarding. If you have a managed switch, try disabling STP on the ports involved (if they're Cisco, enable portfast)

    The ARP looks like it may be related to the second system.  For now I've turned it off for testing.

    The skew on all three CARPs on the backup box was already set to 100.

    Even with the second system off, I am still seeing the drops in traffic.

    JBB



  • Apparently the ARPs were left over from when I had temporarily changed the MAC address.  I did a clean install, set up a basic config and the ARPs were gone.

    However, I still got the drops in traffic.

    I then upgraded to the 2.0 beta.  Things are very different now:

    I am seeing a drop in traffic once every 5 minutes, 30 seconds (325-330 seconds).  This is with about 90+ Mbit of incoming traffic, and about 2-3 Kbits outgoing.  This drop is as regular as a clock.

    This system is running a dual-core Atom D510 processor, with the 386 version of pfSense installed.  It has 1 gig of memory.  The motherboard has 2 onboard NICs, using the Intel 82574L.  I also have a dual NIC card, Intel E1G42ET, which use the Intel® 82576 Gigabit Ethernet Controller.  I'm seeing the same dropouts whether I use the two onboard NICs or the addin NIC card.

    JBB


  • Rebel Alliance Developer Netgate

    Can you try a different set of switches?



  • @jimp:

    Can you try a different set of switches?

    No.  What I've done for a test is to install CentOS on the same firewall/router, and configure it as a simple firewall/gateway, doing the same thing I had pfSense.  Where pfSense was having the dropouts in traffic once ever few minutes, I just finished a 2 hour test running 600 Mbit of traffic through the switches, and didn't have any problem.

    I need to check one thing on Monday before I decide to forget about pfSense.  While setting up the CentOS, I had a funny thing happen which I still can't identify.  The thought occured to me that it may have been causing the pfSense problem as well, I'll find out on Monday.

    JBB



  • I did my test, no change.  The thought was that there was another NIC on the network which might have been trying to take the IP address.  I removed it and saw no change in my problems.

    I really want to get this working, but can't afford to spend much more time on this.

    JBB


  • Rebel Alliance Developer Netgate

    It may be something in the NIC drivers then.

    Did you try disabling Hardware Checksums under advanced options?



  • Actually, this problem may be related to another problem I posted at the same time:

    http://forum.pfsense.org/index.php/topic,25874.msg135322.html#msg135322

    I've been concentrating on the other problem because that one made pfSense unusable for my application.

    For now I'm forced to go with another solution.

    JBB


Locked