Minor issue with dpinger monitor in HA configuration
-
I have a pair of routers running in a cluster. All functions work as they should, with a single exception. The gateway monitor does not work on the secondary router until it becomes master. When the secondary is not master, it reports 100% packet loss. We only have a single WAN connection in this location so we have gateway monitoring action disabled, so it doesn't really matter as such. The secondary, even with dpinger not working correctly, still has internet access and can install packages and pull in updates like normal. Basically everything works as you would expect, except dpinger.
I have another location with a cluster that has the exact same configuration in terms of outbound NAT rules and interface IP addressing and dpinger works there.
I have noticed if I add an outbound NAT rule on my secondary that is Source: This Firewall, any protocol, any destination, attached to WAN interface, and set to use WAN address (NOT the CARP VIP address), then dpinger starts working on the secondary.
This rule does not exist on my other cluster and dpinger works there without this special NAT rule. I'm not sure what else is different between the two clusters. I am including basic configuration info below. I am not including my actual WAN IP addresses. I am listing arbitrary private space IP addresses for my WAN addresses, but I do have actual public IPs there.
Cluster where dpinger is not working:
WAN CARP IP: 10.1.250.50/24
WAN Primary: 10.1.250.51/24
WAN Secondary: 10.1.250.52/24
WAN Gateway: 10.1.250.1/24LAN CARP IP: 192.168.250.1/24
LAN Primary: 192.168.250.253/24
LAN Secondary: 192.168.250.254/24Dpinger is configured to continuously ping the WAN gateway IP (10.1.250.1). This works on the primary and does not work on the secondary. It does work on the secondary if the secondary becomes master. It stops working once the secondary returns to backup. If I add the aforementioned outbound NAT rule dpinger will work on secondary while it's backup.
The configuration of my other cluster that doesn't have this problem is completely identical, except it does not need the aforementioned outbound NAT rule for dpinger to work on its secondary while in backup state.
It's odd to me that the aforementioned NAT rule seems to help. My understanding has always been (perhaps incorrect) that choosing This Firewall as the source in an outbound NAT rule is the same as localhost subnet (127.0.0.0/8). There are outbound NAT rules for that already; changing them to use the WAN address instead of the WAN CARP address doesn't solve the problem. This makes me think that the source "This Firewall" is NOT the same as the localhost subnet.
Just trying to understand what's different here and why I would need two different outbound NAT configurations on two different clusters that are identical in their configuration as far as I can tell.
-
Additional information
On my cluster where dpinger is failing, I added a floating rule attached to the WAN interface and set it to match. I then set the rule to apply to basically any traffic outbound from the WAN interface and I set it to log connections in the system log. Then I restarted dpinger. Sure enough, dpinger on the secondary is trying to use the CARP address as its source (which the secondary doesn't have right now). So that explains that.
I performed the same test on my cluster where dpinger is working correctly. On that cluster, the rule recorded that the secondary used the interface IP address as its source, NOT the CARP VIP like my misbehaving cluster did. The oubtound NAT rules on both clusters are indeed THE SAME.
As an experiment, I tried changing the NAT rules on the secondary that used localhost / loopback as its source so they natted to the WAN interface IP and not the CARP VIP. No change, secondary still trying to use the CARP VIP that it doesn't have.
When I added the aforementioned NAT rule to the secondary that referenced This Firewall (self) as the source and the destination as any non-private IP address, then dpinger on the secondary starts using the WAN address instead of the WAN CARP VIP. However, this behavior doesn't occur on the primary when we manually failover to make the secondary the master.
The only thing I can tell that's different between the configurations, is that in the cluster that is working, we are running Netgate appliances with pfsense+ 23.09.1. The cluster that is working incorrectly is running pfsense CE 2.7.2 on third party hardware. Possibly a bug in CE that's not present in pfsense+? Regardless of the cause, I can only conclude that dpinger is behaving differently in one of my clusters than it is another.