PfSense not recovering from WAN failure



  • Hi all

    I'm running the latetest pfSense on a PC Engines APU which is connected to a cable modem from UPC Cablecom in bridged mode (pfSense directly get's a dynamic public IP). Once in a while (way too often) the WAN fails. Already two times the WAN connection did not recover and a manual interaction was required. Manually renewing the DHCP Lease on the WAN interface is sufficient to bring WAN back to live. During the failure apinger considers the interface as down (which is logical).

    I suspect that the problem comes from a combination of the pfSense and the cable modem but I'm having difficulties to troubleshoot it or to find a work around. Has anyone else experienced similar problems? Is there a way to tell pfSense to "reset" the WAN interface if it goes down? Any ideas on how I can troubleshoot this? I'll gladly provide further information.

    Thanks in advance!



  • I had a similar issue. I inadvertently posted the solution in another thread, so pardon the double posts. Under Interfaces > WAN try these advanced options:

    Before this I would have to call the ISP and request a reboot of the modem after an outage if I was not on site renew the DHCP lease.



  • What IP, if any, does it have on WAN at the time? joako's guess is probably pretty likely, usually that's attributable to your cable modem handing out a private IP lease when it's lost upstream connectivity. If it has no IP at the time, your dhclient logs from Status>System logs, DHCP tab would be necessary to tell anything.



  • Thanks joako and cmb!  :D This provides an possible explanation of the issue.
    I haven't noticed that the Firewall would get a private IP from the cable modem. The issue hasn't appeared again for a while now (and I hope it will stay like that). If it happens again I will pay specific attention to the WAN IP the Firewall gets.



  • Hi all,

    It's happening again. Now I have more detailed information and logs. So hopefully someone here can help me troubleshoot.

    Here what happens

    • WAN connection shows problems and eventually goes down completely (confirmed from provider)

    • WAN comes back up after some time

    • pfSense still thinks WAN is "down" (monitoring shows 100% loss)

    • Manually releasing / renewing DHCP lease on the WAN interface brings the pfSense back online.

    Observations

    • pfSense has same public IP all the time (cable modem is configured in bridge mode). So the solution stated by joako seems not to apply unfortunately.

    • dpinger does log package loss but not the interface as being down (see logs below), but

    • Monitoring (Quality) shows 100% loss and,

    • the Gateway widget on the Dashboard shows the WAN as offline

    Questions
    Might there be something wrong with my configuration? Is there a way to have the pfSense recovering from such failures more gracefully and without the need of a manual interaction? A script which is executed and brings the interface up again?

    Logs
    Status > System Logs > System > Gateways

    
    Time	Process	PID	Message
    Jul 4 11:09:11 	dpinger 		send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr 77.xx.xx.1 bind_addr 77.xx.xy.129 identifier "WAN_DHCP "           #THIS IS AFTER DHCP RENEW (2\. time)
    Jul 4 11:09:08 	dpinger 		WAN_DHCP 77.xx.xx.1: sendto error: 65
    Jul 4 11:09:08 	dpinger 		WAN_DHCP 77.xx.xx.1: sendto error: 65
    Jul 4 11:09:07 	dpinger 		WAN_DHCP 77.xx.xx.1: sendto error: 65
    Jul 4 11:09:07 	dpinger 		WAN_DHCP 77.xx.xx.1: sendto error: 65
    Jul 4 11:09:06 	dpinger 		WAN_DHCP 77.xx.xx.1: sendto error: 65
    Jul 4 11:09:06 	dpinger 		WAN_DHCP 77.xx.xx.1: sendto error: 65
    Jul 4 11:09:05 	dpinger 		WAN_DHCP 77.xx.xx.1: sendto error: 65
    Jul 4 11:09:05 	dpinger 		WAN_DHCP 77.xx.xx.1: sendto error: 65
    Jul 4 11:09:04 	dpinger 		WAN_DHCP 77.xx.xx.1: sendto error: 65
    Jul 4 11:09:04 	dpinger 		WAN_DHCP 77.xx.xx.1: sendto error: 65     #THIS IS WHEN DHCP RELEASE WAS MANUALLY TRIGGERED (2.time)
    Jul 4 10:20:53 	dpinger 		WAN_DHCP 77.xx.xx.1: Alarm latency 8220us stddev 2679us loss 22%     #THIS IS WHEN THE ISSUE BEGAN (2\. time)
    Jul 4 08:58:01 	dpinger 		send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr 77.xx.xx.1 bind_addr 77.xx.xy.129 identifier "WAN_DHCP "       #THIS IS AFTER DHCP RENEW
    Jul 4 08:57:58 	dpinger 		WAN_DHCP 77.xx.xx.1: sendto error: 65
    Jul 4 08:57:57 	dpinger 		WAN_DHCP 77.xx.xx.1: sendto error: 65
    Jul 4 08:57:57 	dpinger 		WAN_DHCP 77.xx.xx.1: sendto error: 65
    Jul 4 08:57:56 	dpinger 		WAN_DHCP 77.xx.xx.1: sendto error: 65
    Jul 4 08:57:56 	dpinger 		WAN_DHCP 77.xx.xx.1: sendto error: 65
    Jul 4 08:57:55 	dpinger 		WAN_DHCP 77.xx.xx.1: sendto error: 65
    Jul 4 08:57:55 	dpinger 		WAN_DHCP 77.xx.xx.1: sendto error: 65
    Jul 4 08:57:54 	dpinger 		WAN_DHCP 77.xx.xx.1: sendto error: 65     #THIS IS WHEN DHCP RELEASE WAS MANUALLY TRIGGERED
    Jul 4 08:22:44 	dpinger 		WAN_DHCP 77.xx.xx.1: Alarm latency 8248us stddev 2020us loss 22%     #THIS IS WHEN THE ISSUE BEGAN
    Jul 2 23:13:48 	dpinger 		WAN_DHCP 77.xx.xx.1: Clear latency 79384us stddev 269716us loss 11%
    
    

    Status > System Logs > DHCP (dhclient only)

    
    Time	Process	PID	Message
    Jul 4 11:09:09 	dhclient 	95303	bound to 77.xx.xy.129 -- renewal in 33465 seconds.
    Jul 4 11:09:09 	dhclient 		Creating resolv.conf
    Jul 4 11:09:09 	dhclient 		/sbin/route add default 77.xx.xx.1
    Jul 4 11:09:09 	dhclient 		Adding new routes to interface: re0
    Jul 4 11:09:09 	dhclient 		New Routers (re0): 77.xx.xx.1
    Jul 4 11:09:09 	dhclient 		New Broadcast Address (re0): 255.255.255.255
    Jul 4 11:09:09 	dhclient 		New Subnet Mask (re0): 255.255.240.0
    Jul 4 11:09:09 	dhclient 		New IP Address (re0): 77.xx.xy.129
    Jul 4 11:09:09 	dhclient 		ifconfig re0 inet 77.xx.xy.129 netmask 255.255.240.0 broadcast 255.255.255.255
    Jul 4 11:09:09 	dhclient 		Starting add_new_address()
    Jul 4 11:09:09 	dhclient 		REBOOT
    Jul 4 11:09:09 	dhclient 	95303	DHCPACK from 10.yz.yz.1
    Jul 4 11:09:09 	dhclient 	95303	DHCPREQUEST on re0 to 255.255.255.255 port 67
    Jul 4 11:09:09 	dhclient 		PREINIT
    Jul 4 11:09:02 	dhclient 	52332	exiting.
    Jul 4 11:09:02 	dhclient 	52332	connection closed
    Jul 4 08:57:58 	dhclient 	51480	bound to 77.xx.xy.129 -- renewal in 37400 seconds.
    Jul 4 08:57:58 	dhclient 		Creating resolv.conf
    Jul 4 08:57:58 	dhclient 		/sbin/route add default 77.xx.xx.1
    Jul 4 08:57:58 	dhclient 		Adding new routes to interface: re0
    Jul 4 08:57:58 	dhclient 		New Routers (re0): 77.xx.xx.1
    Jul 4 08:57:58 	dhclient 		New Broadcast Address (re0): 255.255.255.255
    Jul 4 08:57:58 	dhclient 		New Subnet Mask (re0): 255.255.240.0
    Jul 4 08:57:58 	dhclient 		New IP Address (re0): 77.xx.xy.129
    Jul 4 08:57:58 	dhclient 		ifconfig re0 inet 77.xx.xy.129 netmask 255.255.240.0 broadcast 255.255.255.255
    Jul 4 08:57:58 	dhclient 		Starting add_new_address()
    Jul 4 08:57:58 	dhclient 		REBOOT
    Jul 4 08:57:58 	dhclient 	51480	DHCPACK from 10.yz.yz.1
    Jul 4 08:57:58 	dhclient 	51480	DHCPREQUEST on re0 to 255.255.255.255 port 67
    Jul 4 08:57:58 	dhclient 		PREINIT
    Jul 4 08:57:53 	dhclient 	71337	exiting.
    Jul 4 08:57:53 	dhclient 	71337	connection closed
    Jul 4 05:44:39 	dhclient 		Creating resolv.conf
    Jul 4 05:44:39 	dhclient 		RENEW
    
    

    Status > System Logs > System > General (additional ARP messages removed)

    
    Jul 4 09:05:14 	kernel 		arp: 192.168.150.47 moved from f4:f9:51:d9:43:c9 to 70:73:cb:b4:c7:1f on re1_vlan2
    Jul 4 08:58:11 	radiusd 	35760	Ready to process requests.
    Jul 4 08:58:11 	radiusd 	35254	Loaded virtual server <default>Jul 4 08:58:10 	radiusd 	27655	Exiting normally.
    Jul 4 08:58:10 	radiusd 	27655	Signalled to terminate
    Jul 4 08:58:09 	radiusd 	27655	Ready to process requests.
    Jul 4 08:58:09 	radiusd 	27452	Loaded virtual server <default>Jul 4 08:58:09 	php-fpm 	59374	/rc.start_packages: Skipping STARTing packages process because previous/another instance is already running
    Jul 4 08:58:08 	radiusd 	35007	Exiting normally.
    Jul 4 08:58:08 	radiusd 	35007	Signalled to terminate
    Jul 4 08:58:08 	php-fpm 	41273	/rc.start_packages: Restarting/Starting all packages.
    Jul 4 08:58:08 	check_reload_status 		Starting packages
    Jul 4 08:58:08 	php-fpm 	59374	/rc.newwanip: pfSense package system has detected an IP change or dynamic WAN reconnection - 77.xx.xy.129 -> 77.xx.xy.129 - Restarting packages.
    Jul 4 08:58:07 	check_reload_status 		Starting packages
    Jul 4 08:58:07 	php-fpm 	41273	/rc.newwanip: pfSense package system has detected an IP change or dynamic WAN reconnection - -> 10.0.8.1 - Restarting packages.
    Jul 4 08:58:07 	check_reload_status 		Reloading filter
    Jul 4 08:58:07 	php-fpm 	41273	/rc.newwanip: rc.newwanip called with empty interface.
    Jul 4 08:58:07 	php-fpm 	41273	/rc.newwanip: rc.newwanip: on (IP address: 10.0.8.1) (interface: []) (real interface: ovpns1).
    Jul 4 08:58:07 	php-fpm 	41273	/rc.newwanip: rc.newwanip: Info: starting on ovpns1.
    Jul 4 08:58:06 	check_reload_status 		rc.newwanip starting ovpns1
    Jul 4 08:58:05 	kernel 		ovpns1: link state changed to UP
    Jul 4 08:58:05 	php-fpm 	59374	/rc.newwanip: Creating rrd update script
    Jul 4 08:58:05 	check_reload_status 		Reloading filter
    Jul 4 08:58:05 	check_reload_status 		Reloading filter
    Jul 4 08:58:05 	kernel 		ovpns1: link state changed to DOWN
    Jul 4 08:58:05 	php-fpm 	59374	/rc.newwanip: Resyncing OpenVPN instances for interface WAN.
    Jul 4 08:58:04 	php-fpm 	59374	/rc.newwanip: phpDynDNS (): No change in my IP address and/or 25 days has not passed. Not updating dynamic DNS entry.
    Jul 4 08:58:01 	php-fpm 	59374	/rc.newwanip: ROUTING: setting default route to 77.xx.xx.1
    Jul 4 08:57:59 	php-fpm 	59374	/rc.newwanip: rc.newwanip: on (IP address: 77.xx.xy.129) (interface: WAN[wan]) (real interface: re0).
    Jul 4 08:57:59 	php-fpm 	59374	/rc.newwanip: rc.newwanip: Info: starting on re0.
    Jul 4 08:57:58 	check_reload_status 		rc.newwanip starting re0
    Jul 4 08:57:58 	kernel 		arpresolve: can't allocate llinfo for 77.xx.xx.1 on re0
    Jul 4 08:57:57 	kernel 		arpresolve: can't allocate llinfo for 77.xx.xx.1 on re0
    Jul 4 08:57:56 	kernel 		arpresolve: can't allocate llinfo for 77.xx.xx.1 on re0
    Jul 4 08:57:56 	kernel 		arpresolve: can't allocate llinfo for 77.xx.xx.1 on re0
    Jul 4 08:57:56 	kernel 		arpresolve: can't allocate llinfo for 77.xx.xx.1 on re0
    Jul 4 08:57:54 	kernel 		arpresolve: can't allocate llinfo for 77.xx.xx.1 on re0
    Jul 4 08:57:54 	kernel 		arpresolve: can't allocate llinfo for 77.xx.xx.1 on re0
    Jul 4 08:57:54 	kernel 		arpresolve: can't allocate llinfo for 77.xx.xx.1 on re0
    Jul 4 08:57:54 	kernel 		arpresolve: can't allocate llinfo for 77.xx.xx.1 on re0
    Jul 4 08:56:19 	kernel 		arp: 192.168.150.47 moved from 70:73:cb:b4:c7:1f to f4:f9:51:d9:43:c9 on re1_vlan2
    Jul 4 08:55:38 	check_reload_status 		Syncing firewall
    Jul 4 08:55:22 	check_reload_status 		Syncing firewall
    ....
    Jul 4 08:54:34 	php-fpm 	83885	/index.php: Successful login for user 'admin' from: 192.168.148.160
    Jul 4 08:53:19 	kernel 		arp: 192.168.150.47 moved from 70:73:cb:b4:c7:1f to f4:f9:51:d9:43:c9 on re1_vlan2
    .....
    Jul 4 08:23:19 	kernel 		arp: 192.168.150.47 moved from 70:73:cb:b4:c7:1f to f4:f9:51:d9:43:c9 on re1_vlan2
    Jul 4 08:22:46 	php-fpm 	83885	/rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed its IP. Reloading endpoints that may use WAN_DHCP.
    Jul 4 08:22:45 	php-fpm 	86536	/rc.dyndns.update: phpDynDNS (): No change in my IP address and/or 25 days has not passed. Not updating dynamic DNS entry.
    Jul 4 08:22:44 	check_reload_status 		Reloading filter
    Jul 4 08:22:44 	check_reload_status 		Restarting OpenVPN tunnels/interfaces
    Jul 4 08:22:44 	check_reload_status 		Restarting ipsec tunnels
    Jul 4 08:22:44 	check_reload_status 		updating dyndns WAN_DHCP
    Jul 4 08:21:58 	kernel 		arp: 192.168.150.47 moved from f4:f9:51:d9:43:c9 to 70:73:cb:b4:c7:1f on re1_vlan2</default></default> 
    


  • Bump for this. I have exact the same issue that was happened right after ISP remotely upgraded the modem firmware and rebooted.



  • I'm seeing this as well and would really like to fix it. dhclient seems to give up and stay with its current lease for over a day. That's not going to work for me. If I have to, I'll write a cron script to restart dhclient every 5 minutes when I can't ping google or similar. Or just all the time. I did try the timing options joako mentioned. Perhaps that will help, I added the modem IP to reject leases as well, but it's not getting one from there, it just seems like a good idea in general.



  • Still no solution?  I have SG-2440 attached to cable modem.
    I have similar problem reported by OP.

    Most of the time my connection comes back after 2-3 minutes.
    I have the 'honor' of being the last connection on the cable line.



  • I just got bit by this one myself. I've got it set to reject DHCP leases from the cable modem and when I came home I had no IPv4 address listed in the interface status. A quick renew fixed it but I would love to have something that retry DHCP lease every so often.



  • I may be having the same problem as FreeMinded
    Manually Saving Interfaces/WAN brings the connection back.
    My modem is part of a satellite service which in Australia means the connection (upstream of modem) is sometimes intermittent. (ie all too frequent need for manual intervention to restore service).
    Has anyone succeeded in working around this problem?

    I am on latest release:
    2.3.4-RELEASE (amd64)



  • I also have the same issue with UPC.
    But in my case the UPC DHCP server does not answer and pfSense then uses the recorded lease which does not work and just goes offline without trying to get a new lease. Is there a way to keep trying to get a new lease without a timeout?

    Even though the logs says that it is using the recorded IP, pfSense actually does not. In Status => Interfaces WAN Interface (wan, em0) the Status is up, but the interface has no IP.

    
    Jul 21 02:41:18 pfs dhclient[38369]: em0 link state up -> down
    Jul 21 02:41:19 pfs dhclient[32644]: connection closed
    Jul 21 02:41:19 pfs dhclient[32644]: exiting.
    Jul 21 02:41:57 pfs dhclient: PREINIT
    Jul 21 02:41:57 pfs dhclient[54351]: DHCPREQUEST on em0 to 255.255.255.255 port 67
    Jul 21 02:41:59 pfs dhclient[54351]: DHCPREQUEST on em0 to 255.255.255.255 port 67
    Jul 21 02:42:03 pfs dhclient[54351]: DHCPREQUEST on em0 to 255.255.255.255 port 67
    Jul 21 02:42:08 pfs dhclient[54351]: DHCPDISCOVER on em0 to 255.255.255.255 port 67 interval 2
    Jul 21 02:42:10 pfs dhclient[54351]: DHCPDISCOVER on em0 to 255.255.255.255 port 67 interval 3
    Jul 21 02:42:13 pfs dhclient[54351]: DHCPDISCOVER on em0 to 255.255.255.255 port 67 interval 8
    Jul 21 02:42:21 pfs dhclient[54351]: DHCPDISCOVER on em0 to 255.255.255.255 port 67 interval 8
    Jul 21 02:42:29 pfs dhclient[54351]: DHCPDISCOVER on em0 to 255.255.255.255 port 67 interval 15
    Jul 21 02:42:44 pfs dhclient[54351]: DHCPDISCOVER on em0 to 255.255.255.255 port 67 interval 9
    Jul 21 02:42:53 pfs dhclient[54351]: DHCPDISCOVER on em0 to 255.255.255.255 port 67 interval 13
    Jul 21 02:43:06 pfs dhclient[54351]: DHCPDISCOVER on em0 to 255.255.255.255 port 67 interval 3
    Jul 21 02:43:09 pfs dhclient[54351]: No DHCPOFFERS received.
    Jul 21 02:43:09 pfs dhclient[54351]: Trying recorded lease 84.112.2X4.XX
    Jul 21 02:43:09 pfs dhclient: TIMEOUT
    Jul 21 02:43:09 pfs dhclient: Starting add_new_address()
    Jul 21 02:43:09 pfs dhclient: ifconfig em0 inet 84.112.2X4.XX netmask 255.255.255.0 broadcast 255.255.255.255 
    Jul 21 02:43:09 pfs dhclient: New IP Address (em0): 84.112.2X4.XX
    Jul 21 02:43:09 pfs dhclient: New Subnet Mask (em0): 255.255.255.0
    Jul 21 02:43:09 pfs dhclient: New Broadcast Address (em0): 255.255.255.255
    Jul 21 02:43:09 pfs dhclient: New Routers (em0): 84.112.2X4.1
    Jul 21 02:43:10 pfs dhclient: New Routers (em0): 84.112.2X4.1
    Jul 21 02:43:11 pfs dhclient: Deleting old routes
    
    

    Also the protocol timing does not seem to work if the documentation is correct because it should retry every 5 minutes: https://www.freebsd.org/cgi/man.cgi?query=dhclient.conf&sektion=5#PROTOCOL_TIMING

    Thanks!



  • on actual isp outages pfsense has been fine for me (when it hasnt I believed it to be an isp issue).

    However yesterday I was testing failover by pulling the wan cable for about 30 seconds and then replacing it.

    After the test my ipv6 traffic was working fine, the wan ipv4 gateway continued to show 100% loss tho, and no action was taken by pfsense to recover the connection, so there was nothing in the logs for it to reinitialise the wan, was very odd.  Otherwise tho in normal use I havent had problems.

    I might play with the settings shown in this thread, I think wont work, but never know. :)



  • Hi I have the same or similar problem. My pfsense box keep dropping the wan ip e.g it set it to 0.0.0.0 and never gets out of this state until I reboot. I have two other devices connected to the same fiber switch and thy have nerver failed for the last 3 years (Dlink router and a alarm box).



  • @bunzilein:

    I also have the same issue with UPC.
    But in my case the UPC DHCP server does not answer and pfSense then uses the recorded lease which does not work and just goes offline without trying to get a new lease. Is there a way to keep trying to get a new lease without a timeout?

    Dear all,

    I have the same problem  :-[
    What i figured out is, if I unplug the WAN-cable from pfSense / cable-modem and plug it in again, than the gateway is „green“ and the Internet works again …

    Has someone solved the problem finally ?
    Is it a pfSense bug ?

    Thanks a lot in advance !!



  • Dear all,

    I have found a solution for my WAN failure problem with UPC and pfSense.

    Symptoms:
    WAN would randomly lose its connection and not recover.
    Status > System Logs > System > Gateways Log ist full of sendto error: 65
    Status > System Logs > System > DHCP Log shows:
    No DHCPOFFERS received.
    Trying recorded lease XXX.XXX.XXX.XXX
    Deleting old routes

    Explanation:
    The UPC router goes down for an unknown reason and might restart. (Not sure what the router is doing)
    When the links to the router comes back up PFS tries to get a lease, but by default times out after 60 seconds and then tries the recorded lease.
    That would in my case even work since I always have the same IP.
    Unfortunately at the end PFS deletes the old routes including the default route.
    This causes the sendto error: 65.
    You can check this by going to: Diagnostics > Routes and check if you still have a default route when WAN is down.
    In my opinion this is a bug.

    Solution:
    In my case it always takes about 3 minutes until the UPC dhcp server starts to respond again.
    So all you have to do is go to: Interfaces > WAN
    DHCP Client Configuration section
    check Options Advanced Configuration
    Set the Protocol timing Timeout higher.
    I chose 1800 which is 30 minutes, but I wanted to be on the safe side.

    Since then my WAN has recovered every time.
    I also attached a screenshot.

    ![Bildschirmfoto 2017-08-24 um 12.10.17.png](/public/imported_attachments/1/Bildschirmfoto 2017-08-24 um 12.10.17.png)
    ![Bildschirmfoto 2017-08-24 um 12.10.17.png_thumb](/public/imported_attachments/1/Bildschirmfoto 2017-08-24 um 12.10.17.png_thumb)



  • @bunzilein, thank you very much for sharing your experience and solution  :)

    Is there any possibility to inform the pf-Team on this issue to check whether it is a bug or not ?

    As newbie, may I ask you to explain or to share to me some link where I can study what each of the "Protocol timing" options mean and what the values are doing (causing), if you don't mind, please ?

    Thanks again a lot !!



  • The link is actually on the bottom of the DHCP client configuration section: https://www.freebsd.org/cgi/man.cgi?query=dhclient.conf&sektion=5#PROTOCOL_TIMING
    If this also solves your problem and you have the same behavior with default route being deleted then maybe it is a bug.
    Test it and if you find it to be a bug as well report it like this: https://doc.pfsense.org/index.php/Bug_reporting



  • thanks bunzilein will try your solution.

    I had an actual isp outage a couple of nights ago, the ipv6 came back up by itself but the ipv4 didnt, the unit just sat there not even showing activity in logs trying to reconnect.  When I manually disabled the WAN and reenabled it again it came back up fine.

    Also my isp is dhcp also, so is possibly perhaps a default settings issue.



  • @bunzilein, belated thanks for your posted tentative solution.
    I have set my porotocol timing timeout to 1800 seconds and will run with that from here on or until further indications arise.  Unfortunately it can be a month or more between the observed incidents so it will take a while to draw subjective conclusions.