Problem with standby node

cmouse

We have a HA setup here, in which the standby node is constantly having issues. It's mostly unreachable over LAN and if you manage to actually reach it, it keeps kicking you out of SSH after a while, and the WEB UI becomes unresponsive. We have rebooted it several times to no avail. Any ideas how this could be debugged further?

The HA pair is Netgate SG-8860 version 2.4.2-RELEASE-p1 (amd64).

cmouse

After much headscratching, found out that

someone had configured a /32 instead of /29 for WAN IP, which caused some of the issues.
for some reason, the device mgmt does not work outside L2 network when it's in standby mode, this has now been confirmed with both HA members after CARP failover:
- HTTP(S) does not even open most of the time, and if it does, gets stuck
- SSH is cut after few seconds

SammyWoo

Someone? Sounds like somebody just quit and your boss just told you, Sam, this is now yours! Have fun. :D

Derelict

Then you still have something misconfigured.

A properly-configured HA pair is always accessible, both primary and secondary, using the interface IP addresses.

You need to make sure ALL of your interfaces exactly match exactly match exactly match on both nodes in the same order in the same order in the same order. This means the physical interface (igb1 em2, ix1.102 etc) and the internal, logical interface name (wan, lan, opt1, opt2, etc). Making the descriptive name match exactly is also recommended for sanity's sake (LAN, WAN, DMZ, GUESTWIFI, SYNC, etc).

I use the Status > Interfaces screen to check this since it displays all of the pertinent information in the proper order.

That is where I would start based on your trouble description.

cmouse

I know who did it, just wasn't me.

The primary and secondary are accessible, but the secondary refuses to be accessible over different subnet. If I use the interface IP from same subnet, it works as expected, if I try to use it from different subnet, it misbehaves.

Interface names match on both, for physical, logical and descriptive name, ensuring case is same too.

Derelict

Then either your routing or your rules are wrong.

You will probably have to be more specific and post screenshots of the addresses/interfaces/rules in question to receive more assistance.

cmouse

This is what happens when I try ssh from another L2

~$ ip -4 addr show dev eno1
2: eno1: <broadcast,multicast,up,lower_up>mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
inet 10.217.110.125/24 brd 10.217.110.255 scope global dynamic eno1
valid_lft 56325sec preferred_lft 56325sec
~$ ssh 10.217.1.3 -l admin
Connection to 10.217.1.3 closed by remote host.
Connection to 10.217.1.3 closed.

And from same L2

~$ ssh 10.217.110.3 -l admin
Password for admin@gw13.dovecot.fi:
Netgate SG-8860 …

Here is the relevant ruleset

Could not find anything related/useful in logs.</broadcast,multicast,up,lower_up>

Derelict

Looks like the firewall is either blocking that connection if it has to time out or rejecting that connection if you are getting that connection closed immediately.

That image is too small to read clearly - even with my reader specs.

cmouse

The connection closed comes after a longish delay.

Slightly larger image, hopefully this is more clearer.

Derelict

What interface is that on? What is the interface subnet? What is the source address? What is the target address?

See my sig for the type of information required for us to help you.

cmouse

The setup is like this:

igb0.217 = 10.217.1.1/24 (vip), 10.217.1.2/24 (gw1), 10.217.1.3/24 (gw2)
igb0.100 = 10.217.110.1/24 (vip), 10.217.110.2/24 (gw1), 10.217.1110.3/24 (gw2)

Then gw1 is master, and gw2 is standby, I can access 10.217.1.2 from 10.217.110.125/24 just fine. I can't access 10.217.1.3/24 from that station, but I can access 10.217.110.3 just fine.

If I switch gw1 as standby and gw2 as master, I can't access 10.217.1.2 from 10.217.110.125 anymore, but I can access 10.217.110.2.

In spirit of debugging I have now tested this about 10 times by perusing the 'persistent CARP maintenance mode' on gw1.

the pfSenses are in a HA cluster mode, serving those subnets.

The symptops are:

Login page open, but no matter how long I wait, it won't log in over web UI. (TCP connection is established, but login does not complete)
ssh connection is same, TCP establishes, but the actual login won't complete. The few rare times it does, it kicks you out with 'write failed: Pipe broken' after some seconds.

Derelict

Sounds like you might not be setting the clients to use the CARP VIP as the gateway.

cmouse

Unfortunately the CARP VIP is used. I think I'll just accept that it refuses to work over L3.

Derelict

It works fine. Maybe your switches aren't moving the CARP MAC address like they should.

cmouse

That would imply that nothing would work, but the problem is limited to the standby switch only. Internet works, other resources on the other L2 work, so the gateway MAC cannot be blamed.

Derelict

Telling you, bro. it all works. You have something hosed up or are misunderstanding something.

cmouse

No doubt. Just would be nice to know what.

bpina

Hello,

I have the same issue here. I'm using pfsense 2.4.4.
Being in the pfsense network I have access to the standby node without any problem.
Trying to access the standby node from a different network, https access become unresponsive.

cmouse have you found a way to overcome this issue?