Randomly lost link on WAN interface

FrozenFiber

Hello again everyone

For a good week or so by now, I have experienced the issue of my WAN interface spontaneously deciding to go down and to never come back. It either works fine, or works but constantly loses inbound packets or doesn’t lose any packets but goes up and down again every five seconds. After checking the logs, i found the following sequence of entries constantly repeating itself:

Gateway log:

WAN_DHCP WAN Gateway IP: Alarm latency 414us stddev 32us loss 25% (only appears sometimes)

WAN219_DHCP WAN Gateway IP: sendto error: 65 (repeats itself two or three times per second for 2-3 seconds)

send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr WAN Gateway IP bind_addr IP in WAN network identifier "WAN219_DHCP "

General log:

kernel em0: link state changed to UP

check_reload_status Linkup starting em0

Some OpenVPN stuff

check_reload_status Linkup starting em0

kernel em0: link state changed to DOWN

I also sometimes saw this message repeated several times per second:

arpresolve: can't allocate llinfo for WAN Gateway IP on em0

After a week of looking through the pfSense forums and other parts of the internet and changing out a lot of hardware and software, I am pretty much helpless and don’t know what to do at this point. It seems to be a configuration issue as everything works fine on factory defaults, at least for some time. A hardware issue is rather unlikely as I have tested four different and all known good cables, three different switches (the firewall WAN is connected to a switch) and I swapped the LAN and WAN interfaces to check whether any issues would occur when using the normally WAN interface as a LAN interface. The power supply also seems alright as I checked the output voltage while connecting a dummy device with more than four times the power draw of the actual computer to it. A second power supply also produced the issue. Unfortunately, all of these test came out negative, everything seems to work just fine, or produce the same error.

On the software side, I have also tried almost everything I can possibly imagine. The basic things like lowering MTU and MSS and disabling all hardware offloading aside, I have also reset all gateway settings, disabled the firewall completely, manually forced speed and duplex, increased the size of the state table, toggled do not fragment compatibility, everything. Some things actually worked for some time, like reducing the MTU to 1000 B, but after a few hours the issue started coming again.

Interestingly enough, it seems like other devices have might have something to do with it as the issue occurs after a shorter amount of time when there is another device on the WAN side. When plugging the WAN port into an unmanaged switch without any other device connected, the interface stays up for a good ten minutes or so. As soon as I connect another device, say a laptop for example, the actual link physically goes down and comes back, a cycle repeating every five seconds or so. It’s like every other device randomly sends a kill signal to the WAN port, telling it to turn itself off.

The most confusing thing for me is that it does not always happen for some reason. On several occasions, the WAN interface worked perfectly fine for almost 20 hours and transported several hundred gigabytes of data and millions of packets before the issues start to arise. Then play around with the settings a little, like resetting to factory defaults, and the issue is gone. Restore the configuration five minutes later and it starts again. Restore to factory defaults again, wait a day or so, and it starts again.

The computer running pfSense itself is highly reliable from what I can tell (it’s been running for several weeks nonstop before without any issues).

Some more information about my specific configuration: The only package I have installed is iperf, which is not running most of the time. The firewall is usually running at idle, with at most 10 to 20 percent CPU load. Thermals are also well below 40C. The computer pfSense is running on has two Intel NIC’s, one i211 and one i219-V. The drivers are igb and em respectively. I have used the i219 one as the WAN interface. There are usually six to seven devices connected to the firewall, all in their own VLAN and via a managed switch.

So here’s a quick summary of what I have already tried with the result:

Reboot -> No effect
Disable and reenable Interface -> No effect
Reinstall pfSense -> Temporary fix
Reset to factory defaults -> Temporary fix
Swap LAN and WAN NIC’s -> Temporary fix
Disable VLANS -> No effect
Change WAN MTU -> Temporary fix
Change power supplies -> No effect
Turn off all hardware offloading -> No effect
Change WAN side switch -> No effect
Change LAN side switch -> No effect
Reset gateway settings -> No effect
Forced gateway to be always considered up -> No effect
Increased state table size -> No effect
Disable firewall -> No effect
Force Speed & Duplex -> No effect
Change network cables -> No effect
Toggled do not fragment compatibility -> No effect
Toggled state killing on gateway failure -> No effect
Toggled reset all state if WAN IP address changes -> No effect
Disabled NTP, DHCP, DNS resolver and all OpenVPN clients -> No effect
Wait for a new DHCP lease & IP on WAN -> No effect

At this point, I really don’t have the slightest idea on how to solve this issue and I would be thankful for some ideas! Thanks.

stephenw10

@FrozenFiber said in Randomly lost link on WAN interface:

sendto error: 65

That error indicates there is no route to the gateway:
https://docs.netgate.com/pfsense/en/latest/routing/gateway-monitoring-errors.html#sendto-error-65

Since with a DHCP connection the gateway is almost always the next hop it usually implies the WAN IP address has been lost. What does the system loh show at or just before you see that error in the gateway log?

What is your WAN connection there exactly?

Steve

FrozenFiber

Thanks Steve for your answer. I looked through my logs again, and found a good sample of repeating log entries, all related to the issue (otherwise the log is quiet for hours). The IP the DHCP gave me is 192.168.100.50, the gateway is on 192.168.100.100.

As for my WAN connection, it’s a simple 100Mb Ethernet link leading to my nearest unmanaged switch, which then in turn directly connects to the gateway router, which connects to the internet. This creates a double NAT situation, which is anything but ideal, but that has worked fine for months with pfSense by now.

error log.txt

stephenw10

Your WAN interface is actually losing link:

5/15/2020 19:13	php-fpm	12911		/rc.linkup: DEVD Ethernet detached event for wan
5/15/2020 19:13	check_reload_status			Linkup starting em0
5/15/2020 19:13	kernel			em0: link state changed to DOWN

Which now I check back was in your first post!

It looks like you've swapped out enough things that should eliminate a bad NIC, cable or switch port. It sure looks likely to be one of those things though.

It's linked at 100M? You could try setting to link speed to 100M-FD fixed if the switch will allow it. That might eliminate any link negotiation issues. It should normally always be set to auto-negotiate though.

How far is it to the switch? Can you add another switch much closer as a test?

Steve

valdask

I think that I have very similar problem; WAN randomly loses link.

I'v put a switch before pfsense, and this issue seems to have disappeared. Using i211 nic and 2.5.0 version.

I suspect this has to do something with length of cable to my ISP, as it's quite long; maybe signal is too weak and using switch to send it out helps?

Or it could be something EEE related?

f.meunier

Hi all
I also experienced that kind of trouble (v2.5.0) last week :
*WAN interface is UP in dashboard
*gateway is up (connected to an "interconnect" dumb switch : if I connect a laptop it responds)
*gateway monitor tells me it's down
*ping from pfSense tells me it's down
*arp status tells me incomplete resolution for gateway

Had to reboot pfSense -> everything is OK

nota : pfSense is set for HA, but second fw is down (voluntarily) since roughly 1 month. Maybe something to do with CARP or HA sync ?

stephenw10

You can disable EEE on most NICs with a loader variable or sysctl. Worth trying if you think it might be that.

Steve

valdask

@stephenw10 Do you happen to know how to do this for Intel NIC as I believe that FreeBSD 11 -> 12 made some changes and I can't find any docs or guides on how to do that properly; as all google results are out of date?

stephenw10

What NIC is it?

valdask

@stephenw10 It should be I211 ( https://ark.intel.com/content/www/us/en/ark/products/64404/intel-ethernet-controller-i211-at.html )

$ pciconf -lv | grep -A1 -B3 network

igb0@pci0:1:0:0:        class=0x020000 card=0x00008086 chip=0x15398086 rev=0x03 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = 'I211 Gigabit Network Connection'
    class      = network
    subclass   = ethernet