Randomly lost link on WAN interface



  • Hello again everyone

    For a good week or so by now, I have experienced the issue of my WAN interface spontaneously deciding to go down and to never come back. It either works fine, or works but constantly loses inbound packets or doesn’t lose any packets but goes up and down again every five seconds. After checking the logs, i found the following sequence of entries constantly repeating itself:

    Gateway log:

    WAN_DHCP WAN Gateway IP: Alarm latency 414us stddev 32us loss 25% (only appears sometimes)

    WAN219_DHCP WAN Gateway IP: sendto error: 65 (repeats itself two or three times per second for 2-3 seconds)

    send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr WAN Gateway IP bind_addr IP in WAN network identifier "WAN219_DHCP "

    General log:

    kernel em0: link state changed to UP

    check_reload_status Linkup starting em0

    Some OpenVPN stuff

    check_reload_status Linkup starting em0

    kernel em0: link state changed to DOWN

    I also sometimes saw this message repeated several times per second:

    arpresolve: can't allocate llinfo for WAN Gateway IP on em0

    After a week of looking through the pfSense forums and other parts of the internet and changing out a lot of hardware and software, I am pretty much helpless and don’t know what to do at this point. It seems to be a configuration issue as everything works fine on factory defaults, at least for some time. A hardware issue is rather unlikely as I have tested four different and all known good cables, three different switches (the firewall WAN is connected to a switch) and I swapped the LAN and WAN interfaces to check whether any issues would occur when using the normally WAN interface as a LAN interface. The power supply also seems alright as I checked the output voltage while connecting a dummy device with more than four times the power draw of the actual computer to it. A second power supply also produced the issue. Unfortunately, all of these test came out negative, everything seems to work just fine, or produce the same error.

    On the software side, I have also tried almost everything I can possibly imagine. The basic things like lowering MTU and MSS and disabling all hardware offloading aside, I have also reset all gateway settings, disabled the firewall completely, manually forced speed and duplex, increased the size of the state table, toggled do not fragment compatibility, everything. Some things actually worked for some time, like reducing the MTU to 1000 B, but after a few hours the issue started coming again.

    Interestingly enough, it seems like other devices have might have something to do with it as the issue occurs after a shorter amount of time when there is another device on the WAN side. When plugging the WAN port into an unmanaged switch without any other device connected, the interface stays up for a good ten minutes or so. As soon as I connect another device, say a laptop for example, the actual link physically goes down and comes back, a cycle repeating every five seconds or so. It’s like every other device randomly sends a kill signal to the WAN port, telling it to turn itself off.

    The most confusing thing for me is that it does not always happen for some reason. On several occasions, the WAN interface worked perfectly fine for almost 20 hours and transported several hundred gigabytes of data and millions of packets before the issues start to arise. Then play around with the settings a little, like resetting to factory defaults, and the issue is gone. Restore the configuration five minutes later and it starts again. Restore to factory defaults again, wait a day or so, and it starts again.

    The computer running pfSense itself is highly reliable from what I can tell (it’s been running for several weeks nonstop before without any issues).

    Some more information about my specific configuration: The only package I have installed is iperf, which is not running most of the time. The firewall is usually running at idle, with at most 10 to 20 percent CPU load. Thermals are also well below 40C. The computer pfSense is running on has two Intel NIC’s, one i211 and one i219-V. The drivers are igb and em respectively. I have used the i219 one as the WAN interface. There are usually six to seven devices connected to the firewall, all in their own VLAN and via a managed switch.

    So here’s a quick summary of what I have already tried with the result:

    Reboot -> No effect
    Disable and reenable Interface -> No effect
    Reinstall pfSense -> Temporary fix
    Reset to factory defaults -> Temporary fix
    Swap LAN and WAN NIC’s -> Temporary fix
    Disable VLANS -> No effect
    Change WAN MTU -> Temporary fix
    Change power supplies -> No effect
    Turn off all hardware offloading -> No effect
    Change WAN side switch -> No effect
    Change LAN side switch -> No effect
    Reset gateway settings -> No effect
    Forced gateway to be always considered up -> No effect
    Increased state table size -> No effect
    Disable firewall -> No effect
    Force Speed & Duplex -> No effect
    Change network cables -> No effect
    Toggled do not fragment compatibility -> No effect
    Toggled state killing on gateway failure -> No effect
    Toggled reset all state if WAN IP address changes -> No effect
    Disabled NTP, DHCP, DNS resolver and all OpenVPN clients -> No effect
    Wait for a new DHCP lease & IP on WAN -> No effect

    At this point, I really don’t have the slightest idea on how to solve this issue and I would be thankful for some ideas! Thanks.


  • Netgate Administrator

    @FrozenFiber said in Randomly lost link on WAN interface:

    sendto error: 65

    That error indicates there is no route to the gateway:
    https://docs.netgate.com/pfsense/en/latest/routing/gateway-monitoring-errors.html#sendto-error-65

    Since with a DHCP connection the gateway is almost always the next hop it usually implies the WAN IP address has been lost. What does the system loh show at or just before you see that error in the gateway log?

    What is your WAN connection there exactly?

    Steve



  • Thanks Steve for your answer. I looked through my logs again, and found a good sample of repeating log entries, all related to the issue (otherwise the log is quiet for hours). The IP the DHCP gave me is 192.168.100.50, the gateway is on 192.168.100.100.

    As for my WAN connection, it’s a simple 100Mb Ethernet link leading to my nearest unmanaged switch, which then in turn directly connects to the gateway router, which connects to the internet. This creates a double NAT situation, which is anything but ideal, but that has worked fine for months with pfSense by now.

    error log.txt


  • Netgate Administrator

    Your WAN interface is actually losing link:

    5/15/2020 19:13	php-fpm	12911		/rc.linkup: DEVD Ethernet detached event for wan
    5/15/2020 19:13	check_reload_status			Linkup starting em0
    5/15/2020 19:13	kernel			em0: link state changed to DOWN
    

    Which now I check back was in your first post!

    It looks like you've swapped out enough things that should eliminate a bad NIC, cable or switch port. It sure looks likely to be one of those things though.

    It's linked at 100M? You could try setting to link speed to 100M-FD fixed if the switch will allow it. That might eliminate any link negotiation issues. It should normally always be set to auto-negotiate though.

    How far is it to the switch? Can you add another switch much closer as a test?

    Steve


Log in to reply