HA failover / failback problem with primary router losing default route
-
We are having an issue with a failover cluster of two routers. Both are Netgate 6100s. A manual failover from primary to secondary by entering persistent maintenance mode on the primary works fine. Failing back from secondary to primary by exiting persistent maintenance mode on the primary doesn't work gracefully. CARP addresses migrate back to primary fine. In fact the whole CARP and master vs backup portion of this seems to work flawlessly in that each router is designated as master or backup when it should be. The problem is that the primary router loses the default route from the routing table, so no traffic can get out to the internet.
A simple action like editing the gateway on the primary in System -> Routing and saving it, without actually changing anything, then applying changes, causes the default route to be added back to the routing table on the primary and restores connectivity.
I suspect this might be happening because we only have a single public ip address to work with. That public IP address is the CARP address, and the wan ports on both routers are using private IP addresses. I will give a detailed configuration below, but this setup is somewhat non-standard and I suspect I've missed a configuration somewhere.
Physical connections:
Our ISP's demarc gives us a cat6 hand off. This is attached to a perimeter switch on Port 2. The primary router's WAN port is attached to port 3 on the perimeter switch. The secondary router's WAN port is attached to port 4 on the perimeter switch.
On the LAN side, the primary router's LAN port (which uses one of the SFP ports) is attached to an SFP port on our internal switch. The secondary router's LAN port (the same SFP port as on the primary) is attached to another SFP port on the same internal switch. We are using a direct attach copper cable for this.
Router configurations:
Primary Router WAN address: 10.3.253.1/30
Secondary Router WAN address: 10.3.253.2/30
On both interfaces, the option to block private and loopback addresses is unchecked. The option to block bogons is still checkedWAN CARP Address (changed for anonymity): 150.1.1.2/27
Primary LAN interface address: 192.168.1.253/24
Secondary LAN interface address: 192.168.1.254/24
LAN CARP address: 192.168.1.1/24In System -> Routing the WAN gateway has this setting enabled: Use non-local gateway through interface specific route.
In Firewall -> NAT -> Outbound NAT, we are set to Manual Outbound NAT. The auto generated rules were retained but modified so that the outbound public address is the CARP IP instead of the WAN interface address.
Our DHCP servers on the primary reference the secondary on its interface address as a failover peer (192.168.1.254). The DHCP servers on the secondary reference the primary as a failover peer on the primary's interface address (192.168.1.253).
In System -> Routing the default gateway is set to our ISP's gateway (not set to Automatic. We are specifying our ISP gateway here specifically). Our WAN is static addressed by us, and does not receive a DHCP address from the ISP.
We are using pfsense+ version 23.09.1
Looking for a solution besides purchasing two additional static IP addresses. We will consider doing that in the long term, but I have some other locations where we will not be able to do that, so I need this kind of configuration to work smoothly.
-
@bp81
Did you configure a gateway group workaround, that allows the backup node to access the internet?
If not and you don't need it, go to the WAN gateway settings and check "Disable Gateway Monitoring Action". See if it helps. -
@viragomann said in HA failover / failback problem with primary router losing default route:
@bp81
Did you configure a gateway group workaround, that allows the backup node to access the internet?
If not and you don't need it, go to the WAN gateway settings and check "Disable Gateway Monitoring Action". See if it helps.We don’t have a gateway group work around implemented right now. We also have “disable gateway monitoring action” already checked.
Two things occur to me.
-
Should we take this a bit further and disable gateway monitoring entirely?
-
Should we set the default gateway to Automatic instead of specifying our gateway as default? We only have a single WAN here so it’s not as if this practically matters.
Also I wasn’t aware there was a work around for the secondary to access the internet while it’s the Backup. I might look into this as well.
-
-
@bp81 said in HA failover / failback problem with primary router losing default route:
Should we take this a bit further and disable gateway monitoring entirely?
I think, disable the gateway monitoring action (always considered as up) should be sufficient.
Should we set the default gateway to Automatic instead of specifying our gateway as default? We only have a single WAN here so it’s not as if this practically matters.
Maybe it helps.
Also I wasn’t aware there was a work around for the secondary to access the internet while it’s the Backup
Without this, the primary cannot access the internet (use the default gateway). This could also be a reason for your issue.
-
@viragomann said in HA failover / failback problem with primary router losing default route:
@bp81 said in HA failover / failback problem with primary router losing default route:
Should we take this a bit further and disable gateway monitoring entirely?
I think, disable the gateway monitoring action (always considered as up) should be sufficient.
Should we set the default gateway to Automatic instead of specifying our gateway as default? We only have a single WAN here so it’s not as if this practically matters.
Maybe it helps.
Also I wasn’t aware there was a work around for the secondary to access the internet while it’s the Backup
Without this, the primary cannot access the internet (use the default gateway). This could also be a reason for your issue.
Do you happen to have any information on how to setup the gateway group workaround? I've been searching for information on this and haven't found it yet.
-
@bp81
It might seem a bit complicated and I have never set this up myself to be honest. So it's also theory for me. But let's try.The point is, that the box, which is in CARP backup state cannot access the internet, since you have only a single WAN IP, which is occupied by the master node.
The idea is, however, that backup node can use the master as internet gateway (LAN address) for this time.
In case the master role swaps over to the other node, the second one should also be able to access the internet.Primary LAN interface address: 192.168.1.253/24
Secondary LAN interface address: 192.168.1.254/24Primary:
-
System >High Availability
Disable the synch of "Static Route configuration". -
System > Routing > Gateways
Add a gateway with the secondary LAN IP192.168.1.254.
State a public monitoring IP (e.g. 1.1.1.1). Note that this adds a static route for the monitoring IP. Hence you cannot access this IP from behind the firewalls anymore. -
Enable monitoring of the WAN gateway.
-
Go to the Gateway Groups and create a gateway group. Set the WAN gateway as Tier 1 and the secondary LAN as Tier 2.
-
Go to the Gateways tab and select the gateway group as default gateway.
Socondary:
It's almost the same here:-
System > Routing > Gateways
Add a gateway with the primary LAN IP 192.168.1.253.
State a public monitoring IP. Can be the same as on the primary. -
Enable monitoring of the WAN gateway.
-
Go to the Gateway Groups and create a gateway group. Set the WAN gateway as Tier 1 and the primary LAN as Tier 2.
-
Go to the Gateways tab and select the gateway group as default gateway.
-
-
We just ended up adding additional IP addresses. It was easier to do that than to experiment on this after hours.