PFsense WAN gateway randomly goes down
-
Re: PFsense random loss of WAN gateway
Hello!
Over the last couple of months especially, I have been experiencing random drops with my WAN. I have tested everything on the hardware side, and am pretty sure it is an issue relating to my ISP and not my hardware. The posts here seem to have a lot in common with my issue, and they may be possible solutions, but I wanted to elaborate and ask.
Network = Fiber (Metronet) > ISP-Modem (Motorolla ONT) > Ethernet in Intel quad-port Nic/Dell R210ii pfSense box> Switch > PC or AP.
Basically, at random times every day or two my home network gateway(modem) is lost and the pfsense device thinks there is no internet connection. Indeed, if I reset either the modem -OR- the pfSense Box, everything works fine. I can also fix this problem by manually unplugging the WAN ethernet cable from the pfSense box and plugging it back in
As possible solutions, I have disabled gateway monitoring, but even with that disabled - internet went down a few minutes before this post. Once again, solution was simply to restart the pfsense device.
It's not convenient to keep needing to restart the device, or constantly check myself or need to physically unplug cables or reboot on site. I'd love to just run some sort of rule or command to restart the WAN gateway/firewall every time it thinks it's down. And continue "gateway monitoring".
One post talked about the option of using a CRON command to reboot the device/pfsense every day, but this would still possibly lead to hours of the connection being downand a lot o battery consumption on portable devices that rely on the network as they keep trying to reconnect.
After reading on (mostly) Reddit and these posts:
-
https://forum.netgate.com/topic/167206/gateway-drops-and-never-comes-back
-
https://forum.netgate.com/topic/182286/supersede-dhcp-server-identifier-255-255-255-255-not-working
-
https://forum.netgate.com/topic/174919/pfsense-random-loss-of-wan-gateway (@johnnyf1ve)
-
https://www.reddit.com/r/PFSENSE/s/pzQ47coYzd
A kind Redditor (@dudeman2009 ) replied to that post (Link 3) seems to have come up with the reason this happens, and a proper solution to the root cause. The OP in link 3 seems to have had almost the exact same issue as me, and reported that the Redditor suggested this for his fix:
I Quote:
"Turn gateway monitoring back on. Your issue is not with that. It's with Metronet DHCP relays not responding to unicast renewals, the logs just confirm my suspicions. Perform the following. Goto interfaces > WAN and under DHCP client configuration check the box "Advanced configuration" and under presets select FreeBSD default. Then further down under Lease requirements and requests in the box "Option Modifiers" enter the following supersede dhcp-server-identifier 255.255.255.255
This will ensure you are using the proper renewal timing requests and not an older saved config and also forces PFsense to request WAN lease renewal on broadcast. Since you are behind CGNAT you request your external IP from a DHCP relay that controls your assigned subnet and sends the request to their central server that manages entire groups of subnets. This is a requirement for CGNAT to work with the topology they have setup and they just don't know what they are doing to properly route a unicast request up to the DHCP servers, they just have really basic routing for broadcasts on port 67 I think."
I am not a network engineer so this is a little tricky for me to understand, but I am curious about a couple of things.
First, if this fix will work for my case as well?
Second, if so, can someone please help me understand what the problem is and why this fix works in a little bit simpler terms? I am genuinely curious why it seems that for Metronet ISP, this seems to be a common issue.My most recent outage happened on 1/5/2024 around 0:00, and persisted until I manually rebooted the system around 0:08. After reboot, sure enough, eveyrhting was back up and working just fine.. until the next outage happens.
This is my DHCP status logs filtered to only 'dhclient' entries if it is helpful for diagnosing:
1/5/2024 0:03 dhclient 12666 DHCPREQUEST on igb0 to 255.255.255.255 port 67
1/5/2024 0:08 dhclient 72115 PREINIT
1/5/2024 0:08 dhclient 13303 DHCPREQUEST on igb0 to 255.255.255.255 port 67
1/5/2024 0:08 dhclient 13303 DHCPREQUEST on igb0 to 255.255.255.255 port 67
1/5/2024 0:08 dhclient 13303 DHCPACK from 100.110.160.2
1/5/2024 0:08 dhclient 93142 REBOOT
1/5/2024 0:08 dhclient 94049 Starting add_new_address()
1/5/2024 0:08 dhclient 94874 ifconfig igb0 inet 100.110.168.88 netmask 255.255.224.0 broadcast 100.110.191.255
1/5/2024 0:08 dhclient 96090 New IP Address (igb0): 100.110.168.88
1/5/2024 0:08 dhclient 96751 New Subnet Mask (igb0): 255.255.224.0
1/5/2024 0:08 dhclient 97747 New Broadcast Address (igb0): 100.110.191.255
1/5/2024 0:08 dhclient 98417 New Routers (igb0): 100.110.160.1
1/5/2024 0:08 dhclient 99153 Adding new routes to interface: igb0
1/5/2024 0:08 dhclient 99942 Creating resolv.conf
1/5/2024 0:08 dhclient 13303 bound to 100.110.168.88 -- renewal in 28203 seconds.Thanks in advance!!
-
-
Do you have the logs from before you rebooted when it was failing to renew? It should be pretty easy to see it sending to a unicast address and seeing to replies.
It's pretty easy to set those dhcp client values as a test though.
-
@stephenw10 Hi,
Sorry about that, I thought i pasted in the entire dhcpd.log file. Here is the dhcpd log filtered to 'dhcient' from before and after I manually rebooted the pfsense box. The reboot event timestamp was at Jan 5 12:08:59 AM
https://pastebin.com/7ZvCWqqn
I mispoke before. I have set up pfsense to monitor 9.9.9.9 in the gatway settings, but i have gateway monitoring action disabled. here are the entries from the gatway log from the same time period:
1/3/2024 12:53:54 AM pfSense dpinger[92293]: WAN_DHCP 9.9.9.9: sendto error: 65
1/3/2024 12:53:55 AM pfSense dpinger[92293]: WAN_DHCP 9.9.9.9: sendto error: 65
1/3/2024 12:53:55 AM pfSense dpinger[92293]: WAN_DHCP 9.9.9.9: sendto error: 65
1/3/2024 12:53:56 AM pfSense dpinger[92293]: WAN_DHCP 9.9.9.9: sendto error: 65
1/3/2024 12:53:56 AM pfSense dpinger[92293]: WAN_DHCP 9.9.9.9: sendto error: 65
1/3/2024 12:53:58 AM pfSense dpinger[92293]: exiting on signal 15
1/3/2024 12:53:58 AM pfSense dpinger[4411]: send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 1 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% alarm_hold 10000ms dest_addr 9.9.9.9 bind_addr 100.110.168.88 identifier "WAN_DHCP "
1/4/2024 11:20:30 PM pfSense dpinger[4411]: WAN_DHCP 9.9.9.9: Alarm latency 12603us stddev 911us loss 21%
1/4/2024 11:58:39 PM pfSense dpinger[4411]: exiting on signal 15
1/4/2024 11:58:39 PM pfSense dpinger[36474]: send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 1 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% alarm_hold 10000ms dest_addr 9.9.9.9 bind_addr 100.110.168.88 identifier "WAN_DHCP "
1/4/2024 11:58:41 PM pfSense dpinger[36474]: WAN_DHCP 9.9.9.9: Alarm latency 0us stddev 0us loss 100%
1/5/2024 12:09:00 AM pfSense dpinger[13286]: send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 1 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% alarm_hold 10000ms dest_addr 9.9.9.9 bind_addr 100.110.168.88 identifier "WAN_DHCP "
1/5/2024 12:09:01 AM pfSense dpinger[13286]: exiting on signal 15
1/5/2024 12:09:01 AM pfSense dpinger[48237]: send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 1 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% alarm_hold 10000ms dest_addr 9.9.9.9 bind_addr 100.110.168.88 identifier "WAN_DHCP " -
Hmm, well you can still try those values but you can see it's already broadcasting for a DHCP server before your restarted it:
Jan 4 11:48:05 PM pfSense dhclient[12666]: DHCPREQUEST on igb0 to 255.255.255.255 port 67 Jan 5 12:03:56 AM pfSense dhclient[12666]: DHCPREQUEST on igb0 to 255.255.255.255 port 67 Jan 5 12:08:57 AM pfSense dhclient[72115]: PREINIT Jan 5 12:08:57 AM pfSense dhclient[13303]: DHCPREQUEST on igb0 to 255.255.255.255 port 67 Jan 5 12:08:59 AM pfSense dhclient[13303]: DHCPREQUEST on igb0 to 255.255.255.255 port 67 Jan 5 12:08:59 AM pfSense dhclient[13303]: DHCPACK from 100.110.160.2
So what exactly changed there? I'd try to get a packet capture across that event and see what, if anything, is different in the successful request.
How long had it been down for before that?
You can see on other occasions it immediately gets an ACK as soon as it gives up trying unicast and broadcasts:
Jan 5 12:07:53 PM pfSense dhclient[13303]: DHCPREQUEST on igb0 to 10.207.250.30 port 67 Jan 5 12:24:28 PM pfSense dhclient[13303]: DHCPREQUEST on igb0 to 10.207.250.30 port 67 Jan 5 12:58:34 PM pfSense dhclient[13303]: DHCPREQUEST on igb0 to 10.207.250.30 port 67 Jan 5 1:38:18 PM pfSense dhclient[13303]: DHCPREQUEST on igb0 to 10.207.250.30 port 67 Jan 5 3:12:27 PM pfSense dhclient[13303]: DHCPREQUEST on igb0 to 255.255.255.255 port 67 Jan 5 3:12:27 PM pfSense dhclient[13303]: DHCPACK from 100.110.160.2 Jan 5 10:12:27 AM pfSense dhclient[36533]: RENEW Jan 5 10:12:27 AM pfSense dhclient[37231]: Creating resolv.conf Jan 5 3:12:27 PM pfSense dhclient[13303]: bound to 100.110.168.88 -- renewal in 31997 seconds.
The other thing is that the renewal time is quite high, ~9hrs. That implies a 17Hr lease, pfSense will try to renew the lease at 50% of the time.
-
Thank you so much for providing this information. Ever since Xfinity did their infrastructure upgrade in my area I would have intermittent connectivity with one of my WAN's. Currently, I am running two WANS (both Xfinity) and have them in a load balance configuration. When I initially set this up in Pfsense everything was working fine. After the Xfinity upgrade the non-default WAN would intermittently lose connectivity and show as 100% packet loss. The weird thing about this one it would only drop the non-default gateway. The default gateway was always up. So if I swapped the default, the packet loss would also follow the other non-default gateway. So I knew this wasn't a hardware problem, For the past 4 months I have been trying numerous troubleshooting steps including a complete reconfigure of my pfsense setup from scratch and nothing worked, at least not until I added 'supersede dhcp-server-identifier 255.255.255.255' under "Option Modifiers"
Thanks again, this saved my sanity. :)
-
-