Problem with state sync in CARP



  • Hi,

    I have the following setup:

    2 physical servers with ESX 5.5 on each
    connected to Juniper EX4300 (stacked)
    NICs on VMware are teamed
    pfsense 2.2.5 VM on each ESX
    it's a new setup from scratch (I installed 2.2.4 and upgraded within 3 days).

    Setting up CARP. Everything works great until I enable the state sync on the Master.
    Then the master remains cool while the backup starts misbehaving.

    It's not in the production so I don't know if it has other side effects,
    but just from browsing through the Web GUI I see it resetting the connection all the time,
    i.e. it starts loading the page then hangs, then the browser gives the SSL error page.

    The log has errors like:
    lighttpd[38727]: (connections.c.263) SSL: -1 5 1 Operation not permitted
    lighttpd[38727]: (connections.c.620) connection closed: write failed on fd 16
    lighttpd[38727]: (network_openssl.c.143) SSL: 5 -1 1 Operation not permitted

    I also noticed after previous reboots it would generate a new SSL certificate.
    It seems it stopped doing that though.

    If I uncheck the state sync checkbox on the master - it immediately starts to behave well.
    The checkbox on the backup is always on.

    I have about 20 interfaces, they are exactly the same on both machines and the order is right.
    The sync is on the separate interface.

    Net.ReversePathFwdCheckPromisc is changed to "1" on every vswitch.
    all of these are "accepted":
        Enable promiscuous mode on the vSwitch
        Enable "MAC Address changes"
        Enable "Forged transmits"

    What can be the problem?

    Thanks



  • I personally don't like running carp in VMWare. The promiscuous mode on the vswitch causes extra load on the CPU versus running as a regular router/firewall.

    If you already have it, let VMware clustering handle VM move to the second host in the case there is a hardware failure.
    You can also create an alert rule that if any nic goes down that is associated with the firewall to vmotion it to the other host.

    If you don't have vmotion through clustering, are you setting up the state sync using IP or multicast? I would use a direct IP.
    What do you have in the logs on each of the systems that might indicate a problem?



  • Well, I already have 2 setups in VMware and they are working great.
    This is the 3rd one, and right after I had these problems I set up another one on the same machines - different IPs/VLANs/WANs though - it's working fine.
    So it's only this cluster that I had trouble with.

    are you setting up the state sync using IP or multicast? I would use a direct IP.

    What do you mean? Whether I inserted an IP under the state sync? Yes, I did.

    The logs are only from the web gui - the ones I already mentioned.
    These are the latest ones:

    
    lighttpd[37238]: (connections.c.1692) SSL (error): 5 -1 1 Operation not permitted
    Nov 12 14:09:57 	lighttpd[37238]: (connections.c.619) connection closed: write failed on fd 17
    Nov 12 14:09:57 	lighttpd[37238]: (network_openssl.c.118) SSL: 5 -1 1 Operation not permitted
    Nov 12 13:26:36 	lighttpd[37238]: (connections.c.1692) SSL (error): 5 -1 1 Operation not permitted
    Nov 12 13:26:36 	lighttpd[37238]: (connections.c.619) connection closed: write failed on fd 16
    Nov 12 13:26:36 	lighttpd[37238]: (network_openssl.c.118) SSL: 5 -1 1 Operation not permitted
    Nov 12 13:19:37 	lighttpd[37238]: (connections.c.1692) SSL (error): 5 -1 1 Operation not permitted
    Nov 12 13:19:37 	lighttpd[37238]: (connections.c.619) connection closed: write failed on fd 16
    Nov 12 13:19:37 	lighttpd[37238]: (network_openssl.c.118) SSL: 5 -1 1 Operation not permitted
    Nov 12 13:19:31 	lighttpd[37238]: (connections.c.1692) SSL (error): 5 -1 1 Operation not permitted
    Nov 12 13:19:31 	lighttpd[37238]: (connections.c.619) connection closed: write failed on fd 18
    
    

    Should I look in the WAN-VLAN direction? All the IPs are private. Outbound NAT is set to AON. All the rules on it are disabled.
    I removed all the VIPs and it was fine. I added 2 main WANs and the problem came back.



  • By the way, I reinstalled the machines, this time with version 2.2.4 and uploaded the existing config - no effect.



  • I have a very similar setup to yours, not as many interfaces, but dual ESXi machines with several pfSense clusters running on them - 2.2.4 and 2.2.5, no issues.

    Are you certain there are no VHID conflicts occurring?
    CARP and VRRP use the same protocol ID and VHID will conflict despite the fact that they are incompatible.  Make sure there are no VRRP instances (usually on switches) sharing the same VHID.
    Each interface requires a separate VHID, and if you are running IPv6, it requires a different VHID than IPv4 on the same interface.

    Use  tcpdump -T carp -e -s 0 -n -i interface proto 112 to check for conflicts.
    Make sure that on any one interface you aren't getting 2 advertisements with the same VHID and different information within the 1 second

    For example:
    In this case IPv4 is using VHID 250 as seen by source MAC address: 00:00:5e:00:01:fa (fa=250)
    IPv6 is using VHID 251 as seen by source MAC address: 00:00:5e:00:01:fb (fb=251)

    09:13:16.541596 00:00:5e:00:01:fb > 33:33:00:00:00:12, ethertype IPv6 (0x86dd), length 90: (hlim 255, next-header VRRP (112) payload length: 36) xxxx:xxxx:xxxx:xxxx::7:248 > ff02::12: ip-proto-112 36
    09:13:16.961394 00:00:5e:00:01:fa > 01:00:5e:00:00:12, ethertype IPv4 (0x0800), length 70: (tos 0x10, ttl 255, id 40681, offset 0, flags [DF], proto VRRP (112), length 56)
        xxx.xxx.207.248 > 224.0.0.18: carp xxx.xxx.207.248 > 224.0.0.18: CARPv2-advertise 36: vhid=250 advbase=1 advskew=0 authlen=7 counter=6696271131254409671



  • Well, I actually removed all the VIPs except for 1 WAN and yet the problem exists.
    I also tried changing the VHID to 3 different ones, like 201, 40, 31, i.e. something random so that it will definitely not conflict with anything.
    I don't IPv6.

    This is so weird.
    I read some other posts before about the same errors of the web server and cmb said that it looks like something deletes the state table, which looks like my case.
    But I can't seem to understand what could it be.
    The PFSYNC interface is dedicated and the subnet is /30 (I tried /24 also) and nothing else can interfere.



  • So I reinstalled again from scratch and with a config from scratch.
    The issue is still there.

    Made a packet capture on the pfsense side, this is what I got from the last packet:
    16 2015-11-15 14:57:28.190929 10.X.X.X 172.Y.Y.Y TCP 60 51458 > https [RST] Seq=424 Win=0 Len=0
    Acknowledgment number: Broken TCP. The acknowledge field is nonzero while the ACK flag is not set

    Any idea?



  • Ok, it a reset packet, but what's the context.  So far you've not given any information that can help troubleshoot your problem.
    Screenshots and/or diagram required.


Log in to reply