Not receiving down emails multi-wan in failover config in 24.03 SG1100

Mission-Ghost

@Mission-Ghost here is my current routing table; I’m confused how it’s working at all:


IPv4 Routes
default	100.64.0.1	UGS	7	1500	mvneta0.4090	
1.0.0.2	100.64.0.1	UGHS	25	1500	mvneta0.4090	
1.1.1.2	100.64.0.1	UGHS	25	1500	mvneta0.4090	
9.9.9.9	100.64.0.1	UGHS	25	1500	mvneta0.4090	
10.10.10.1	link#7	UH	22	16384	lo0	
34.120.255.244	link#12	UHS	4	1500	mvneta0.4092	
100.64.0.0/10	link#12	U	9	1500	mvneta0.4092	
100.68.183.215	link#7	UHS	1	16384	lo0	
100.117.61.17	link#7	UHS	3	16384	lo0	
127.0.0.1	link#7	UH	2	16384	lo0	
149.112.112.112	100.64.0.1	UGHS	25	1500	mvneta0.4090	
192.168.2.0/24	link#11	U	5	1500	mvneta0.4091	
192.168.2.1	link#7	UHS	12	16384	lo0	
192.168.10.0/24	link#13	U	15	1500	mvneta0.10	
192.168.10.1	link#7	UHS	11	16384	lo0	
192.168.20.0/24	link#14	U	10	1500	mvneta0.20	
192.168.20.1	link#7	UHS	13	16384	lo0	
192.168.30.0/24	link#15	U	14	1500	mvneta0.30	
192.168.30.1	link#7	UHS	17	16384	lo0	
192.168.40.0/24	link#16	U	16	1500	mvneta0.40	
192.168.40.1	link#7	UHS	19	16384	lo0	
192.168.50.0/24	link#17	U	18	1500	mvneta0.50	
192.168.50.1	link#7	UHS	21	16384	lo0	
192.168.60.0/24	link#18	U	20	1500	mvneta0.60	
192.168.60.1	link#7	UHS	23	16384	lo0	
192.168.100.0/24	link#12	U	8	1500	mvneta0.4092	
192.168.100.1	link#12	UHS	4	1500	mvneta0.4092	
192.168.100.2	link#7	UHS	3	16384	lo0	
206.214.239.195	link#10	UHS	6	1500	mvneta0.4090

stephenw10

Ah, yes having the same gateway on both WANs is a problem. It can partially work for somethings that are routed via an interface specifically but many things will not work since it routes to the gateway IP.

I don't see the static routes to 8.8.8.8 or 8.8.4.4. Are those normally there? It should be unless you have deliberately checked the option not to add it.

But potentially gateway monitoring might not be using the correct link there.

Commonly one WAN would be put behind the ISP router NATing the connection to workaround this.

Mission-Ghost

@stephenw10 said in Not receiving down emails multi-wan in failover config in 24.03 SG1100:

Ah, yes having the same gateway on both WANs is a problem. It can partially work for somethings that are routed via an interface specifically but many things will not work since it routes to the gateway IP.

I don't see the static routes to 8.8.8.8 or 8.8.4.4. Are those normally there? It should be unless you have deliberately checked the option not to add it.

But potentially gateway monitoring might not be using the correct link there.

Commonly one WAN would be put behind the ISP router NATing the connection to workaround this.

I don't know if the static routes to 8.8.8.8 and 8.8.4.4 are normally there. I've never had to know and haven't noted it previoiusly. It just worked. (And does appear to just work; monitoring seems to be functioning as expected).

It seems to see when one gateway drops offline without mistaking it for the other.

I have made sure the System/Routing/Gateways/Edit Static Route "Do not" box and is UNchecked on both gateways and never deliberately checked it as I saw no specific reason to do so.

Is something not working properly to show no static routes to the monitoring address?

Should I add a static route to the monitoring IPs given maybe it's not doing it as expected? Could this be a bug?

I've never (competently) set up static routes myself and don't know much about how to do it.

Unfortunately I have no direct configuration or router control over Starlink's address of the dish, which is the gateway. An quick look search hasn't suggested that there is even a sketchy unsupported way to do it using just Starlink's hardware.

Is there a way to work around it in pfSense?

Both Starlink routers are already in bypass mode. This means their router is completely disabled and the traffic passes straight from the dish to the pfSense SG1100 via Ethernet without their router interfering.

Starlink uses CGNAT in their network (it's actualy double-NATed [at least] but hasn't seemed to be an issue with our LAN use case).

stephenw10

@Mission-Ghost said in Not receiving down emails multi-wan in failover config in 24.03 SG1100:

Both Starlink routers are already in bypass mode.

Exactly. If you take one of them out of bypass mode it will become the gateway for that WAN with a different local subnet and thus pfSense will see a different gateway IP.

Mission-Ghost

@stephenw10

I've tried an experiment with virtual IPs but admit I don't know what I'm doing. This level of networking exceeds my experience, most of which is 30 years old.

I based my efforts on Tech with Shae's video on gaining access to the Starlink stats pages on the dish's own subnet 192.168.100.0/24 (stats pages now apparently removed by Starlink)...but still, maybe it might help pfSense find a consistent way out.

I have virtual IPs for each dish interface (A=192.168.100.2; B=192.168.100.3):

Virtual IP Address
192.168.100.2/32 	STARLINKA  	IP Alias 	Starlink A dish management interface subnet access. 	
10.10.10.1/32 	Localhost  	IP Alias 	pfB DNSBL - DO NOT EDIT 	
192.168.100.3/32 	STARLINKB  	IP Alias 	Starlink B dish management interface subnet access.

I've then set up hybrid outboud NAT rules for both my network management .10 subnet and This Firewall (self) based on Shae's technique:

Mappings
		Interface 	Source 	Source Port 	Destination 	Destination Port 	NAT Address 	NAT Port 	Static Port 	Description 	Actions
		STARLINKA 	192.168.10.0/24 	* 	192.168.100.0/24 	* 	192.168.100.2 (Starlink A dish management interface subnet access.) 	* 		Starlink A dish management interface subnet access. 	
		STARLINKA 	This Firewall (self) 	* 	192.168.100.0/24 	* 	192.168.100.2 (Starlink A dish management interface subnet access.) 	* 		Starlink A dish management interface subnet access. 	
		STARLINKB 	192.168.10.0/24 	* 	192.168.100.0/24 	* 	192.168.100.3 (Starlink B dish management interface subnet access.) 	* 		Starlink B dish management interface subnet access. 	
		STARLINKB 	This Firewall (self) 	* 	192.168.100.0/24 	* 	192.168.100.3 (Starlink B dish management interface subnet access.) 	* 		Starlink B dish management interface subnet access. 	
		STARLINKA 	This Firewall (self) 	123 (NTP) 	* 	123 (NTP) 	STARLINKA address 	* 		NAT firewall NTP via Starlink A. Used to ensure NTP works w/o IPv6 errors. 	
		STARLINKB 	This Firewall (self) 	123 (NTP) 	* 	123 (NTP) 	STARLINKB address 	* 		NAT firewall NTP via Starlink B. Used to ensure NTP works w/o IPv6 errors. 	
		STARLINKA 	Addresses_All_VLANs 	Ports__VoIP_WiFi_Calling 	* 	Ports__VoIP_WiFi_Calling 	STARLINKA address 	* 		Static mapping WiFi-calling ports on Starlink. 	
		STARLINKA 	Addresses_Guest_Games_Network 	Ports__Games 	* 	Ports__Games 	STARLINKA address 	* 		Static mapping Games' ports on Starlink. 	
		STARLINKB 	Addresses_All_VLANs 	Ports__VoIP_WiFi_Calling 	* 	Ports__VoIP_WiFi_Calling 	STARLINKB address 	* 		Static mapping WiFi-calling ports on T-Mobile. 	
		STARLINKB 	Addresses_Guest_Games_Network 	Ports__Games 	* 	Ports__Games 	STARLINKB address 	* 		Static mapping Games' ports on T-Mobile.

What I've found is I now have a default route for each specific hardware interface for each dish (mvneta0.4092 and 4090) but the other monitoring and DNS routes appear to only be going out the A(lpha) Starlink dish on 4090, even the ones that should go out the B(ravo) dish on 4092:

IPv4 Routes
default	100.64.0.1	UGS	0	1500	mvneta0.4092	
default	100.64.0.1	UGS	0	1500	mvneta0.4090	
1.0.0.2	100.64.0.1	UGHS	29	1500	mvneta0.4090	
1.1.1.2	100.64.0.1	UGHS	29	1500	mvneta0.4090	
8.8.4.4	100.64.0.1	UGHS	29	1500	mvneta0.4090	
8.8.8.8	100.64.0.1	UGHS	29	1500	mvneta0.4090	
9.9.9.9	100.64.0.1	UGHS	29	1500	mvneta0.4090	
[...pfBlocker removed for brevity...]
34.120.255.244	link#12	UHS	4	1500	mvneta0.4092	
100.64.0.0/10	link#12	U	9	1500	mvneta0.4092	
100.68.183.215	link#7	UHS	7	16384	lo0	
100.117.61.17	link#7	UHS	3	16384	lo0	
127.0.0.1	link#7	UH	2	16384	lo0	
149.112.112.112	100.64.0.1	UGHS	29	1500	mvneta0.4090	
192.168.2.0/24	link#11	U	5	1500	mvneta0.4091	
192.168.2.1	link#7	UHS	12	16384	lo0	
[...VLANs removed for brevity...]
192.168.100.0/24	link#12	U	8	1500	mvneta0.4092	
192.168.100.1	link#12	UHS	4	1500	mvneta0.4092	
192.168.100.2	link#7	UHS	3	16384	lo0	
192.168.100.3	link#7	UH	30	16384	lo0	
206.214.239.195	link#10	UHS	24	1500	mvneta0.4090

This is confusing, as the monitoring for the Bravo dish seems to be working correctly...latency is different from the Alpha dish and so on.

So now I'm way out on a limb I don't understand and I don't know what's going on.

Any advice and insights would be appreciated. Thanks.

Mission-Ghost

@stephenw10 said in Not receiving down emails multi-wan in failover config in 24.03 SG1100:

@Mission-Ghost said in Not receiving down emails multi-wan in failover config in 24.03 SG1100:

Both Starlink routers are already in bypass mode.

Exactly. If you take one of them out of bypass mode it will become the gateway for that WAN with a different local subnet and thus pfSense will see a different gateway IP.

Unfortunately then I would have the Starlink router blasting unneeded and unwanted WiFi all over the premesis and it won't send the Internet traffic into the pfSense router via the Ethernet adapter.

Bypass = Ethernet to pfSense
No-bypass = WiFi and no Ethernet

stephenw10

Wow there's no Ethernet when it's acting as a router? That sucks.

Some other router in between then might be your only option then. Two gateways with the same address is a conflict and can never work correctly. The only exception to that is for PPPoE links because they are point to point connections. Butt even then some things will misbehave.

Mission-Ghost

@stephenw10 said in Not receiving down emails multi-wan in failover config in 24.03 SG1100:

Wow there's no Ethernet when it's acting as a router? That sucks.

Some other router in between then might be your only option then. Two gateways with the same address is a conflict and can never work correctly. The only exception to that is for PPPoE links because they are point to point connections. Butt even then some things will misbehave.

I think I misspoke; the Ethernet apparently does work when their router is enabled and from some more reading it appears it does use the SL router's DHCP to serve a different IP address range to the Ethernet in the 192.168.1.0/24 set. Ok, yay. In bypass mode, the SL router's DHCP server is disabled and the dish's own 192.168.100.1 address is served to the pfSense. From the dish, so pfSense gets the same IP from each dish.

It certainly does suck, though, because I still want the WiFi completely off. I have my own access points and the SL WiFi will pollute the airwaves with traffic that is useless to me. I'll have to keep looking to see if there's a way to shut off WiFi without bypass mode, so I can keep the SL DHCP server delivering different IP addresses than the dish range. So far it doesn't seem so.

Starlink as a company behaves as if it were founded by a control freak. Strange.

While they have built a groundbreaking and well-functioning service in many respects, their terrestrial consumer-facing engineering seems to be where they assign the unpaid summer interns.

I appreciate your help and attention. Best regards.

Mission-Ghost

@stephenw10 this is an interesting development which may suggest there's a bug. Note: this is now on a Netgate 4200; not the 1100 I previously used.

Following the global Starlink outage, I dropped my Startlink backup Internet on the Bravo gateway and adopted T-Mobile Home Internet (TMHI) as my backup.

To do this, I added a new interface Opt8, renaming it and setting it up for the TMHI. Once the TMHI router arrived, I disabled the Starlink (Bravo) and enabled the TMHI interface, made adjustments to rules and gateway Bravo and misc affected things, NOT including DNS, and it worked fine.

Except the multi-WAN 'down' notification emails stopped functioning again. When the TMHI interface (or RJ-45 jack, I tried both tests) was restored, I'd see the following message in the system log: "Error: Failed to connect to ssl://smtp.gmail.com:465 [SMTP: Failed to connect socket: php_network_getaddresses: getaddrinfo for smtp.gmail.com failed: Name does not resolve (code: -1, response: )]"

Ok, so it's a DNS problem.

I trouble-shot the issue, checked many various DNS related configurations (which I had not changed during the migration from StarlinkB to TMHI) without success.

On system>general, the DNS servers for the TMHI Bravo gateway were properly updated in DNS Servers Gateway pull down to the new and correct TMHI Bravo Opt8 192.168.12.1 gateway.

I decided to test a hypothesis that somewhere there was a bug that prevented these automatically re-configured DNS server entries from working for down gateway email dns lookups (in this case for the google smtp server) when the gateway failed (in this case I disabled it at the Interfaces page for the TMHI gateway) and by deleting the dns entries and adding them back in exactly as before the problem would end.

So I deleted the two Bravo DNS server entries from the System>General Setup, saved, then added them both back (9.9.9.9 and 1.1.1.2), saved, and then re-tried my test of disabling the TMHI interface, and my hypothesis was confirmed. I received the 'omitting' email I had not received since changing from Starlink to TMHI.

I enabled the TMHI interface and received the 'adding' email, also which I had not received since switching to TMHI.

Therefore, it appears deleting the DNS server records for the changed interface/gateway and adding them back in without change solved the problem.

This suggests that after changing an interface assigned to a network port/jack and creating a new gateway using that interface fails to correctly adjust the DNS server records in System>General during that process. It must do it differently than if adding the DNS server records new. This seems to be a bug.

Note 1: prior to the above experiment I tried disabling the interface and using the Diagnostics>DNS Lookup function for smtp.google.com immediately after and it always properly returned the DNS records. Same after re-enabling. So the DNS lookup for the wan-down email was failing even though the Diagnostics>DNS Lookup worked.

Note 2: aside, after the global Starlink outage last week, once the two dishes I was running came back up, the Tier 1 dish was not selected as the default. The default was instead was assigned to the Tier 2 dish.

I didn't know this because service came back up and I didn't sign on to pfSense to check that it worked properly.

The Tier 2 dish was on a backup plan limited to 10GB per month.

Because the Tier 1 dish was incorrectly never restored as the default, the Tier 2 dish plan was completely consumed in a day.

So something associated with both dishes coming back online within a short time of each other caused this erroneous selection of Tier 2 as the default even though Tier 1 was also back up and the Gateway Group in failover configuration was not honored. This too might be a bug.

Thanks.

stephenw10

Hmm, you should be able to check that. When you add a server there it should be added to /etc/resolv.conf.

If it has a gateway set for it you should see a static route added for the server IP via that gateway in the routing table (Diag > Routes).