Multi WAN Failover - DNS Queries and Open States Causing Traffic to Failover WAN

SoCalLen

Hi everyone! I haven't posted here before, but I probably read through well over 100 threads already. :) I searched the forums exhaustively without finding any answers to my current confusion, and finally decided to post something. Thus far I do love pfSense, but this has been making me crazy… :o

I recently built a pfSense system with 1 primary WAN NIC ("WAN"), 1 backup Failover NIC ("OPT1"), and 1 LAN NIC. The issue I'm having relates to too much bandwidth being used on the failover NIC ("OPT1") during non-Failover conditions, due to states that haven't cleared after a Failover, or DNS queries going from the DNS Resolver to the OPT1 failover NIC.

System Setup:

I followed the Multi WAN instructions, and I set up a Gateway Group named "Failover" that consists of the following 2 gateways:

WAN_DHCP (Tier 1 - Primary - Default gateway)
Cable Modem (Arris SB6183)
Getting IP via DHCP of 76.175.X.X

OPT1_DHCP (Tier 2 – Failover)
LTE Modem (Netgear LB1120), using "Ting" (a T-Mobile MVNO)
Set up as "Bridge Mode", getting IP via DHCP of 21.231.X.X

PfSense itself uses an IP of 192.168.1.1 on the LAN, Subnet mask 255.255.255.0.

System/General Setup/DNS Servers are using OpenDNS and Google Public DNS, set as 208.67.220.220 and 8.8.8.8 for WAN_DHCP, and 208.67.222.222 and 8.8.4.4 for OPT1_DHCP.

I am using the default "DNS Resolver" service in pfSense, which is currently set to "Enable Forwarding Mode". Incidentally, at first, I wasn't able to get DNS resolution to work during failover at all, even with 8.8.4.4 and 208.67.222.222 set as the DNS server to use for the OPT1_DHCP Gateway (under System/General Setup). Once I checked "Enable Forwarding Mode" under DNS Resolver Options, clients were finally able to resolve DNS queries during a Failover state, however after doing this I ended up seeing DNS queries being sent over the inactive gateway (OPT1) even when the regular Tier 1 WAN was "up"... more on that is described under "Issue #2" below.

My LTE backup connection is intended to be used somewhat infrequently, and I didn't pay for much bandwidth. The LTE Modem continuously reports a small but steady stream of usage, even long after a Failover state has occurred. In looking at Diagnostics/States in pfSense (filtering by the "OPT1" Interface), it looks like there are 2 separate issues that contribute to this:

Issue # 1 – Straggler States

There are A LOT of connections that remain open from various clients over the Failover WAN (OPT1) long after a Failover situation has ended, because pfSense isn't killing those states automatically after the primary Tier1 "WAN_DHCP" Gateway comes back online. The biggest data hog was a Cisco "automatic VPN" client on my Work laptop, which remained connected to the OPT1 long after the Failover condition ended, and it managed to send and receive well over 200 MB of data via the OPT1 gateway instead of the restored WAN gateway. There were other, less data intensive states that remained open from various other clients on the network. Even the Tivo DVR had an "ESTABLISHED" state that had lasted for over 9 hours since the last Failover condition.

If I manually delete those states, they won't show back up, but it's a manual process that I would have to remember to do. From what I've found in these forums, it seems that it's a known issue and normally we would have to wait until those states end on their own. Is there any way to automatically kill any states on the failover OPT1 gateway after the WAN gateway comes back online and stop unnecessary data from being sent over the Failover after the primary WAN came back online?

Issue # 2 – DNS Queries Being Sent over OPT1 instead of WAN

Even after manually killing the extra straggler states from the clients on my LAN, new ones keep showing up because pfSense keeps sending DNS queries over the OPT1 interface (failover) and the WAN interface (primary). pfSense directs DNS queries to the two DNS servers that I have specified under System/General Setup/DNS Servers for the "OPT1_DHCP" failover gateway, even though that gateway should only be used during a failover situation. Each one is never more than a couple hundred bytes being sent or received, but there are many of them and they add up. Over the course of my 30 day billing period, most of my bandwidth would be used up by DNS queries that are being directed over the failover OPT1 interface instead of the primary WAN.

I tried a couple of things to resolve this DNS issue:

Resolution Attempt # 1 – Tried changing DNS Server Gateways to "none"

In System/General Setup/DNS Server Settings, I tried de-associating the individual "WAN_DHCP" and "OPT1_DHCP" gateways with the 4 DNS Servers by changing the Gateway option to "none". When I filter Diagnostics/States for the 4 DNS servers, I see that all 4 DNS Servers are being sent to WAN instead of OPT1, which is great. During a Failover state, however, none of the clients on the LAN can query DNS, and there are no States present for any of the 4 DNS servers. Under Status/Interfaces, all 4 DNS servers are listed under "WAN", but none are listed under "OPT1".

Resolution Attempt # 2 – Floating Firewall Rule to route port 53 traffic to Gateway Group

Instead of changing the DNS servers to use "none" as their Gateway, I tried adding a Floating Firewall Rule, as follows:
Action: Pass
Interface: None selected (does this mean all?)
Direction: out
Address Family: IPv4
Protocol: TCP/UDP
Source: any
Destination Port Range: From DNS(53) To DNS (53)
Advanced Options / Gateway: Failover (this is the name of my Failover Gateway Group)

Then when I check the States for this Floating Firewall Rule, I see that it is reporting DNS queries are only being sent to the 2 DNS servers that are defined for the WAN interface. DNS queries are no longer being sent via OPT1 to the 2 DNS servers that are defined for the OPT1 interface. During a Failover situation, the DNS queries are being sent via OPT1 to the 2 DNS servers that are defined for the OPT1 interface, but not the 2 DNS servers that are defined for the WAN interface. Yes!!! 8)

Resolution Attempt # 3 – Combining #1 and #2
I tried both #1 and #2 together (setting the gateway for each DNS Server to "none", and using the Floating Firewall Rule to direct port 53 traffic to the Gateway Group). This still caused the same problems that I experienced with Resolution Attempt #1, where clients couldn't query DNS during a Failover situation.

It seems that I can at least solve the DNS issue by using Resolution Attempt #2 above (the Floating Firewall rule). The questions I just can't figure out are:

**1) Why is this Floating Firewall Rule even needed? **
I would think that DNS queries by default should be directed to the Gateway that is currently active when it is part of a Gateway Group, but without the Floating Firewall Rule, the DNS traffic goes to both the primary WAN and the failover OPT1 interfaces, even though OPT1 is supposed to be Tier2 failover only.

2) Why does this Floating Firewall Rule actually work?
When I enable the Firewall Rule to route all port 53 traffic to the gateway group, why does it send DNS queries to the DNS Servers associated with the WAN (8.8.8.8 and 208.67.220.220), but not to the DNS Servers associated with the OPT1 (8.8.4.4 and 208.67.222.222), and the reverse during a failover state? If I temporarily disable this Firewall Rule, DNS Queries begin going to both the WAN and OPT1 again, despite the primary WAN still being "up" and no Failover situation exists. Don't get me wrong – it's doing exactly what I want it to do. But, I thought this firewall rule would have just caused DNS queries to be sent to all 4 DNS Servers, instead of just the 2 that are associated to whichever connection is currently active. Oddly, under Status/Interfaces, all 4 DNS servers are still listed under "WAN", with none are listed under "OPT1", even during a Failover situation. Weird, huh?

3) Is there any hope for an automated way to kill all active states that are using the Failover OPT1 gateway after pfSense switches back to the primary WAN when it's back online?
Perhaps someone already created a cron bash script, or something like that that would periodically look at whether any active states exist over the OPT1 connection while the WAN connection is "up", and therefore kill the OPT1 states?

Any help, ideas or clarification would be appreciated! Thanks!

-Len

eng1tx

@SoCalLen:

Is there any hope for an automated way to kill all active states that are using the Failover OPT1 gateway after pfSense switches back to the primary WAN when it's back online?
Perhaps someone already created a cron bash script, or something like that that would periodically look at whether any active states exist over the OPT1 connection while the WAN connection is "up", and therefore kill the OPT1 states?

Any help, ideas or clarification would be appreciated! Thanks!

-Len

That seems to be the million-dollar question regarding failover. I, too, have been searching for a solution to this. Please see the last post here:
https://forum.pfsense.org/index.php?topic=93998.60

The gentleman above my post has created a script that he says will solve the issue. The only caveat for me is, in all the years I have been using pfsense, I have never worked with scripts. So, getting it installed is my challenge.

Hopefully this post helps.

t__2

I know this post is a bit stale but I have the exact set up as this - a LB1120 in bridged mode with Ting as a backup internet connection. Had the same problem with excess traffic on the backup wan when pfsense is using our main wan. Also using the DNS Resolver. Used your the floating rule (thank you very much) and the excess traffic dropped to 1/10 th of what is was on the backup wan. I also have no idea why your floating rule works. Also reduced the traffic a bit more by logging into the LB1120 web page and went to Settings, Mobile, APN, set the PDP setting to IPV4. It was IPV4/IPV6. Still have small traffic spikes.

Looking at this in more depth today. I turned on logging for that floating rule and then filtered the logs with the source IP of the Netgear modem. So what it looks like is happening is the Netgear modem is sending UDP packets to seemingly random IP's on port 53 (DNS) out our main WAN! I have no idea why that would even happen. Anyway I looked at the IP's and used whois to find out where they are going. Most of them are going to IP's owned by Microsoft. Some to Amazon. Others to other large US companies and others to foreign companies.

I also disabled the floating rule and did a packet capture on the higher traffic that happens. I can see it still doing DNS queries at large companies. That part did not change. However the amount of data gong back and forth on those queries is 10 to 100 times that when the floating rule is active.

Just my observations so far. No idea why this is happening. I am open to any suggestions on how to track down what is happening however.

LC

@t__2 said in Multi WAN Failover - DNS Queries and Open States Causing Traffic to Failover WAN:

Looking at this in more depth today. I turned on logging for that floating rule and then filtered the logs with the source IP of the Netgear modem. So what it looks like is happening is the Netgear modem is sending UDP packets to seemingly random IP's on port 53 (DNS) out our main WAN! I have no idea why that would even happen. Anyway I looked at the IP's and used whois to find out where they are going. Most of them are going to IP's owned by Microsoft. Some to Amazon. Others to other large US companies and others to foreign companies.
I also disabled the floating rule and did a packet capture on the higher traffic that happens. I can see it still doing DNS queries at large companies.

I recently hit the same issue on a brand new MR5200. putting on my tinfoil hat here, it's probably some tracking code in the firmware, what for or why, is anyone else's guess.

https://community.netgear.com/t5/Mobile-Routers-Hotspots-Modems/Netgear-Nighthawk-M5-MR5200-WAN-issue/m-p/2175323/highlight/true#M20286