Problem with standby node



  • We have a HA setup here, in which the standby node is constantly having issues. It's mostly unreachable over LAN and if you manage to actually reach it, it keeps kicking you out of SSH after a while, and the WEB UI becomes unresponsive. We have rebooted it several times to no avail. Any ideas how this could be debugged further?

    The HA pair is Netgate SG-8860 version 2.4.2-RELEASE-p1 (amd64).



  • After much headscratching, found out that

    • someone had configured a /32 instead of /29 for WAN IP, which caused some of the issues.

    • for some reason, the device mgmt does not work outside L2 network when it's in standby mode, this has now been confirmed with both HA members after CARP failover:

      • HTTP(S) does not even open most of the time, and if it does, gets stuck

      • SSH is cut after few seconds



  • Someone? Sounds like somebody just quit and your boss just told you, Sam, this is now yours!  Have fun. :D


  • LAYER 8 Netgate

    Then you still have something misconfigured.

    A properly-configured HA pair is always accessible, both primary and secondary, using the interface IP addresses.

    You need to make sure ALL of your interfaces exactly match exactly match exactly match on both nodes in the same order in the same order in the same order. This means the physical interface (igb1 em2, ix1.102 etc) and the internal, logical interface name (wan, lan, opt1, opt2, etc). Making the descriptive name match exactly is also recommended for sanity's sake (LAN, WAN, DMZ, GUESTWIFI, SYNC, etc).

    I use the Status > Interfaces screen to check this since it displays all of the pertinent information in the proper order.

    That is where I would start based on your trouble description.



  • I know who did it, just wasn't me.

    The primary and secondary are accessible, but the secondary refuses to be accessible over different subnet. If I use the interface IP from same subnet, it works as expected, if I try to use it from different subnet, it misbehaves.

    Interface names match on both, for physical, logical and descriptive name, ensuring case is same too.


  • LAYER 8 Netgate

    Then either your routing or your rules are wrong.

    You will probably have to be more specific and post screenshots of the addresses/interfaces/rules in question to receive more assistance.



  • This is what happens when I try ssh from another L2

    ~$ ip -4 addr show dev eno1
    2: eno1: <broadcast,multicast,up,lower_up>mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
        inet 10.217.110.125/24 brd 10.217.110.255 scope global dynamic eno1
          valid_lft 56325sec preferred_lft 56325sec
    ~$ ssh 10.217.1.3 -l admin
    Connection to 10.217.1.3 closed by remote host.
    Connection to 10.217.1.3 closed.

    And from same L2

    ~$ ssh 10.217.110.3 -l admin
    Password for admin@gw13.dovecot.fi:
    Netgate SG-8860 …

    Here is the relevant ruleset

    Could not find anything related/useful in logs.</broadcast,multicast,up,lower_up>


  • LAYER 8 Netgate

    Looks like the firewall is either blocking that connection if it has to time out or rejecting that connection if you are getting that connection closed immediately.

    That image is too small to read clearly - even with my reader specs.



  • The connection closed comes after a longish delay.

    Slightly larger image, hopefully this is more clearer.


  • LAYER 8 Netgate

    What interface is that on? What is the interface subnet? What is the source address? What is the target address?

    See my sig for the type of information required for us to help you.



  • The setup is like this:

    igb0.217 = 10.217.1.1/24 (vip), 10.217.1.2/24 (gw1), 10.217.1.3/24 (gw2)
    igb0.100 = 10.217.110.1/24 (vip), 10.217.110.2/24 (gw1), 10.217.1110.3/24 (gw2)

    Then gw1 is master, and gw2 is standby, I can access 10.217.1.2 from 10.217.110.125/24 just fine. I can't access 10.217.1.3/24 from that station, but I can access 10.217.110.3 just fine.

    If I switch gw1 as standby and gw2 as master, I can't access 10.217.1.2 from 10.217.110.125 anymore, but I can access 10.217.110.2.

    In spirit of debugging I have now tested this about 10 times by perusing the 'persistent CARP maintenance mode' on gw1.

    the pfSenses are in a HA cluster mode, serving those subnets.

    The symptops are:

    • Login page open, but no matter how long I wait, it won't log in over web UI. (TCP connection is established, but login does not complete)

    • ssh connection is same, TCP establishes, but the actual login won't complete. The few rare times it does, it kicks you out with 'write failed: Pipe broken' after some seconds.


  • LAYER 8 Netgate

    Sounds like you might not be setting the clients to use the CARP VIP as the gateway.



  • Unfortunately the CARP VIP is used. I think I'll just accept that it refuses to work over L3.


  • LAYER 8 Netgate

    It works fine. Maybe your switches aren't moving the CARP MAC address like they should.



  • That would imply that nothing would work, but the problem is limited to the standby switch only. Internet works, other resources on the other L2 work, so the gateway MAC cannot be blamed.


  • LAYER 8 Netgate

    Telling you, bro. it all works. You have something hosed up or are misunderstanding something.



  • No doubt. Just would be nice to know what.


Log in to reply