pfSense 2.7.2 interface Tx underrun, restarting...

dansci

Hey, I have this problem on pfSense 2.7.2 that the LAN interface restarts every few minutes or so. It started today, before that I didn't have this probem. On the screen connected to pfSense I see a message:

stge0: Tx underrun, restarting...
stge0: Tx underrun, restarting...
stge0: Tx underrun, restarting...
stge0: Tx underrun, restarting...
stge0: Tx underrun, restarting...
stge0: too many errors; not reporting any more

I don't know where to look for the above log entries in the files.
In dmesg, I see that this interface has UP and DOWN status many times. All in all, I don't know where in the logs to look for more.

Could I ask for some help?

EDIT: when I'm able to connect to the web interface there are non-zero error counters for all VLAN interfaces (created under the stge0) in the Interfaces Statistics

stephenw10

What changed today that might have triggered this? If anything.

There are few tunables available for that driver. But check what they are set to using: sysctl dev.stge

If there anything else in the system log when that happens?

Steve

dansci

After a night the network stopped working at all. I did a pfSense reboot and now I don't see those messages:
stge0: Tx underrun, restarting....
but the network is still running unstable. I also lose access to the web interface from time to time.

I have Mikrotik switches and actually did a change in dot1x for two ports. On pfSense I have a FreeRADIUS authentication server that handles dot1x. Maybe this is it? For the moment I have disabled those ports completely on the switch, but I still have an unstable LAN and the Errors Out counters increase constantly.

Here is the result
[2.7.2-RELEASE][hidden.host.name]/root: sysctl dev.stge
dev.stge.0.rxint_dmawait: 30
dev.stge.0.rxint_nframe: 8
dev.stge.0.%parent: pci4
dev.stge.0.%pnpinfo: vendor=0x13f0 device=0x1023 subvendor=0x1043 subdevice=0x8180 class=0x020000
dev.stge.0.%location: slot=2 function=0 dbsf=pci0:5:2:0
dev.stge.0.%driver: stge
dev.stge.0.%desc: Sundance ST-1023 Gigabit Ethernet
dev.stge.%parent:

I am also attaching the system log. There it is full of check_reload_status 430 - - Could not connect to /var/run/php-fpm.socket which looks quite worrying.

system.log.txt

-- Daniel

stephenw10

Hmm, yeah those logs aren't good! Is that before the reboot?

I can only imagine that's a stuck process whoch we have seen with check_reload_status in the past. Check the process list.

The output from that NIC could just be a symptom if something else if using all the CPU time.

dansci

The log was after the reboot and it is non-stop filled with such an entry.

I attached the output of ps -awx -l
I don't yet know how to interpret it.
processes.txt

EDIT:
When my network crashes, the session in putty will freeze for a while. I took advantage of this to see what processes are most resource intensive at the time using top -P.
As my network crashes now I have this state in putty:

So php-fpm seems to be a pretty heavy-duty process.

stephenw10

Yeah something is churning scripts there. The CPU time on check_reload_status is huge.

I'd say you probably have a link flapping. It's hard to see from that log because of off the check_reload_status messages but there are a bunch of hotplug messages like:

<27>1 2024-03-27T13:48:18.118187+01:00 hidden.host.name php-fpm 5904 - - /rc.linkup: DEVD Ethernet attached event for opt5
<27>1 2024-03-27T13:48:18.118319+01:00 hidden.host.name php-fpm 5904 - - /rc.linkup: HOTPLUG: Triggering address refresh on opt5 (stge0.80)
<13>1 2024-03-27T13:48:18.118635+01:00 hidden.host.name check_reload_status 430 - - rc.newwanip starting stge0.80
<27>1 2024-03-27T13:48:18.130454+01:00 hidden.host.name php-fpm 5904 - - /rc.linkup: Hotplug event detected for VLAN_71_LAB_INTERNET(opt8) static IP address (4: 192.168.71.1)
<30>1 2024-03-27T13:48:18.142627+01:00 hidden.host.name lighttpd_pfb 21772 - - [pfBlockerNG] DNSBL Webserver started
<27>1 2024-03-27T13:48:18.147659+01:00 hidden.host.name php-fpm 5904 - - /rc.linkup: DEVD Ethernet attached event for opt8
<27>1 2024-03-27T13:48:18.147841+01:00 hidden.host.name php-fpm 5904 - - /rc.linkup: HOTPLUG: Triggering address refresh on opt8 (stge0.71)
<13>1 2024-03-27T13:48:18.148123+01:00 hidden.host.name check_reload_status 430 - - rc.newwanip starting stge0.71

Do you see any actual kernel level link state changes?

dansci

@stephenw10 said in pfSense 2.7.2 interface Tx underrun, restarting...:

Do you see any actual kernel level link state changes?

Not sure how to check it. Attached current dmesg output with a lot of link state changes. dmesg.txt

stephenw10

Yeah that's a LOT of link state changes. All the php load and check_reload_status messages are probably coming from that. Each link state change triggers restarting services.

So is that NIC, stge0, actually losing link? What is it connected to?

dansci

It is connected to one of my Mikrotik switches. Maybe I should reboot it too... :)

EDIT:
I check the logs on the mikrotik. stge0 is connected to the ether24. You can see that it has disconnected there, but it is not noted as often in the logs as on the pfSense side. Those warnings on sfp are also puzzling, I didn't have them before.

stephenw10

Yes try that. But also check any logs the switch has to see it that's seeing the link lost too.

If it's really flapping you might be able to set both sides to a fixed speed to prevent it.

That is not a common NIC though. Swapping it out for something Intel based would almost certainly solve this.

stephenw10

Also odd that it links at 100M then 10M then 1G.

dansci

I replaced the NIC to Realtek. Now my interface is re1 instead of stge0. So I replaced every stge0 yo re1 in my backup config.xml and I followed this article to restore config from USB: https://linuxconfig.org/restore-pfsense-configuration-backup-from-console-using-usb-drive

But for some reason it is still trying to configure my VLANs as stge0.xx

Do you know what I should delete additionally so it woulf take the new config into account?

EDIT:
I set a trap for myself. I copied the config that was there before as config.xml to the USB root directory. I didn't know about ECL and when I had that flash drive plugged in, despite changing the config, ECL reloaded the previous one from the USB on reboot.

Ok, now I have a machine with a realtek card instead of the previous one.

dansci

After almost 1h of startup, the error counter is still at 0, the logs no longer contain these strange entries, and the network is running stably. It seems to me that the problem has been solved.

Thank you very much for your support! I appreciate it!

dansci

After 17h, I see that the error counter shows something, however. I will honestly admit that I have never checked it before. Is it normal that errors appear on the Tx interface?

stephenw10

It's common to have a few errors especially if it was recently disconnected. That looks high though. Check the cable.