CPU 100%, unbound and dhcpd restarting whenever the filter reloads
-
@Uglybrian said in CPU 100%, unbound and dhcpd restarting whenever the filter reloads:
I would try turning off watchdog. As I understand it, watchdog is used for developers not production.
Okay, I've uninstalled the Service Watchdog package. Re-testing to see if if changes the initial behavior. UPDATE: No change in behavior
@Gertjan
The "Realtek is bad, don't use it" sentiment is understandable, but unhelpful. I've been using this hardware for 5 years without issue.
I am trying to find the root cause. And its the NIC down-up events that seem to come AFTER the initial issue of a watchdog or some detection event that is triggering the down-up to happen.Also, pfsense1.log.txt shows the same issue, but NOT showing the interfaces flapping. Just the newwanip script detecting something that makes it want to restart all packages.
-
@Uglybrian said in CPU 100%, unbound and dhcpd restarting whenever the filter reloads:
I would try turning off watchdog. As I understand it, watchdog is used for developers not production.
After uninstalling the Service Watchdog... the issue still happens, but while dhcpd comes back up... unbound stays off, and I need to manually restart the service.
Watchdog was NOT the root cause, but was working to recover.
-
@bmeeks said in CPU 100%, unbound and dhcpd restarting whenever the filter reloads:
@pfuser23984 said in CPU 100%, unbound and dhcpd restarting whenever the filter reloads:
Thanks. I installed realtek-re-kmod and rebooted, but no change in the behavior. It isn't random, it is being triggered. I'm trying to figure out the logic of what this "watchdog" is (it does not appear to be a "service" watchdog., and why it is timing out on re1 LAN interface... which seems to be causing the cascade of hotplug event detections and link state changes.
pfsense2.log.txt:
Jan 23 07:16:15 kernel re1: watchdog timeout
It is also different behavior when this is triggered by
rc.newwanip starting ovpns2
(pfsense1.log.txt) when OpenVPN server is started. But similarly...
pfSense package system has detected an IP change or dynamic WAN reconnection - 10.0.10.1 -> 10.0.10.1 - Restarting packages.
is not right. Why would it decide to restart all packages because of an IP that HAS NOT changed?I do not have any gateway monitoring actions enabled either.
Certain events detected by the pfSense subsystem automatically kick off scripts to adjust for the event. One of those is when the system believes the NIC has disconnected and then reconnected. The automatic assumption is the IP configuration is likely to have changed.
Your Realtek NIC is flapping, and that is causing pfSense to trigger its automatic scripts. You can't stop that. Instead, you must correct the flapping of the NIC interface.
The "watchdog" mentioned in the log message is not the Service Watchdog package someone referred to earlier. Instead, that is a built-in hardware thingy in the Realtek NIC. You must have a newer Realtek NIC that is not compatible with the FreeBSD version used in pfSense. You have a NIC driver compatibilty issue, and the only solution is to change the NIC driver.
You can change the NIC driver most easily by replacing the Realtek NIC with something better supported like Intel. If you have space in the hardware, buy a cheap Intel NIC off Amazon or elsewhere and install that. Remove the Realtek NIC or disable it if it is an onboard option.
This still does not make any sense.
How can hardware be to blame for something that has been working for years?
I would understand if this was a new install problem. But it's not.Also, again. The pfsense1.log.txt shows the same issue, but NOT showing the interfaces flapping until much later. Just the newwanip script detecting IP changes that makes it want to restart all packages.
Jan 23 17:46:25 kernel re1: link state changed to DOWN Jan 23 17:46:25 kernel re1: watchdog timeout Jan 23 17:46:25 check_reload_status 80356 Linkup starting re1 Jan 23 17:46:21 php-fpm 80230 /rc.newwanip: Removing static route for monitor 1.1.1.1 and adding a new route through x.x.x.x Jan 23 17:46:18 php-fpm 80230 /rc.newwanip: rc.newwanip: on (IP address: 10.0.10.1) (interface: VPN[opt6]) (real interface: ovpns2). Jan 23 17:46:18 php-fpm 80230 /rc.newwanip: rc.newwanip: Info: starting on ovpns2. Jan 23 17:46:17 check_reload_status 80356 rc.newwanip starting ovpns2 Jan 23 17:46:17 kernel ovpns2: link state changed to UP Jan 23 17:46:17 check_reload_status 80356 Reloading filter Jan 23 17:46:17 php-fpm 80178 OpenVPN PID written: 59515
All I did was turn on the OpenVPN server..
Look at the timestamps The newwanip stuff starts up seconds before the NIC link state changes. And it looks like the watchdog is CAUSING the link state changes to happen. Not the other way around. -
@pfuser23984 said in CPU 100%, unbound and dhcpd restarting whenever the filter reloads:
How can hardware be to blame for something that has been working for years?
pfSense upgrades the FreeBSD OS over time which could well include different drivers. Other than many comments here over the past 10 years or so about Realtek and FreeBSD I don't have much experience except on a CE install and that was fine for years unless we enabled Suricata Inline mode in which case (IIRC) port forwards would stop working after a time.
Jan 23 06:44:41 php-fpm 28016 /rc.newwanip: rc.newwanip: on (IP address: 10.0.10.1) (interface: VPN[opt6]) (real interface: ovpns2). Jan 23 06:44:41 php-fpm 28016 /rc.newwanip: rc.newwanip: Info: starting on ovpns2. ... Jan 23 06:44:41 php-cgi 20458 pfSsh.php: Configuration Change: (system): WAN to VPN Jan 23 06:44:41 php-cgi 20458 pfSsh.php: New alert found: Enabling Rule - WAN to VPN for x.x.x.x on port 443 Jan 23 06:44:40 check_reload_status 28303 rc.newwanip starting ovpns2 Jan 23 06:44:40 kernel ovpns2: link state changed to UP
Is your VPN disconnecting/reconnecting? At least, at 6:44?
"rc.newwanip" is the name of a script and that name can be confusing because AFAIK it's run on any IP change (or add or subtract). Unbound, nginx, and other services may need to bind to the new IP so pfSense restarts many services. It might be clearer if it was renamed "rc.newipdetected" or similar.
"link state changed to DOWN" is the port detecting (reporting) no connection.
-
@SteveITS said in CPU 100%, unbound and dhcpd restarting whenever the filter reloads:
@pfuser23984 said in CPU 100%, unbound and dhcpd restarting whenever the filter reloads:
How can hardware be to blame for something that has been working for years?
pfSense upgrades the FreeBSD OS over time which could well include different drivers. Other than many comments here over the past 10 years or so about Realtek and FreeBSD I don't have much experience except on a CE install and that was fine for years unless we enabled Suricata Inline mode in which case (IIRC) port forwards would stop working after a time.
Jan 23 06:44:41 php-fpm 28016 /rc.newwanip: rc.newwanip: on (IP address: 10.0.10.1) (interface: VPN[opt6]) (real interface: ovpns2). Jan 23 06:44:41 php-fpm 28016 /rc.newwanip: rc.newwanip: Info: starting on ovpns2. ... Jan 23 06:44:41 php-cgi 20458 pfSsh.php: Configuration Change: (system): WAN to VPN Jan 23 06:44:41 php-cgi 20458 pfSsh.php: New alert found: Enabling Rule - WAN to VPN for x.x.x.x on port 443 Jan 23 06:44:40 check_reload_status 28303 rc.newwanip starting ovpns2 Jan 23 06:44:40 kernel ovpns2: link state changed to UP
Is your VPN disconnecting/reconnecting? At least, at 6:44?
"rc.newwanip" is the name of a script and that name can be confusing because AFAIK it's run on any IP change (or add or subtract). Unbound, nginx, and other services may need to bind to the new IP so pfSense restarts many services. It might be clearer if it was renamed "rc.newipdetected" or similar.
"link state changed to DOWN" is the port detecting (reporting) no connection.
It is common for hammers to see only nails. I could tell this community desires to be very helpful, but for many, seeing anything to do with Realtek NICs may cause flashbacks and a deep focus on blaming them.
I am manually toggling the VPN. I have some automation that will do this, but it's just an example. That first log file was me using a pfSsh.php script to do it. But this latest log post, I just test it from the Services page in the UI.
But ANY change to the firewall configuration that causes a filter reload will usually trigger this cascade.
I've been on Pfsense 2.7.2 since April or May 2024, no issues. This just started in December 2024.
I am going to check any packages or patches installed or updated around that time. -
Jan 23 18:23:27 kernel re1.800: link state changed to DOWN Jan 23 18:23:27 kernel re1: link state changed to DOWN Jan 23 18:23:27 kernel re1: watchdog timeout Jan 23 18:23:27 check_reload_status 79149 Linkup starting re1 Jan 23 18:23:20 check_reload_status 79149 Reloading filter Jan 23 18:23:20 check_reload_status 79149 Syncing firewall Jan 23 18:23:20 php-fpm 35087 /system_advanced_misc.php: Configuration Change: user@10.0.1.10 (RADIUS/FreeRADIUS): Miscellaneous Advanced Settings saved
Even benign changes to the configuration in the UI like,
Do NOT send Netgate Device ID with user agent
, triggers the reloading of the filter, which triggers the watchdog timeout.My question is this...
If this were a Realtek related issue, on Layer 2 or Layer 1... wouldn't I expect to see this happen randomly when I am not changing the firewall config or reloading the filter? -
@pfuser23984
When you installed the latest kmod driver, did you follow the steps outlined in this post:
https://forum.netgate.com/topic/160529/realtek-nic-and-watchdog-timeout/13?If not, the new driver is likely not being used. The current driver for pfSense 2.7.2 CE is I believe v1.98 instead of the v1.96 mentioned in the linked post.
Don't focus on the
rc.newwanip
script. That is getting triggered by your VPN interface coming up. As mentioned by another reply, anything that alters interface connectivity triggers that script so that all other firewall processes can be notified of a potential change (such as a newly configured interface, a change in IP address on an existing interface, deletion of an interface, etc.).I found a number of Google search hits on Realtek NICs triggering their internal watchdog timeout when a VPN is brought online and is crossing the NIC. No, I don't have an immediate explanation for why it worked for 5 years as you say. But apparently you have either updated something on the firewall to cause this now or your NIC is glitching from some new hardware anomaly and heading towards failure.
Do you have more VLANs defined now than in the past on this port? Is the traffic across the NIC heavier than in the past (ISP speeds increased, more users, etc.)?
From the logs,
unbound
(the DNS Resolver) is not starting because it seems to already be running (or at least something is occupying the port it wants to use as it is logging "port already in use"). That could be the result of the machine gun burst of "restart all packages" commands happening. -
@pfuser23984 said in CPU 100%, unbound and dhcpd restarting whenever the filter reloads:
I've been on Pfsense 2.7.2 since April or May 2024, no issues. This just started in December 2024.
I am going to check any packages or patches installed or updated around that time.If you can definitely pinpoint the start of the issue as December 2024, then certainly you need to start by looking at all changes made to the firewall since that date. You can examine the configuration changes by looking at the diffs of the automatic
config.xml
backups under DIAGNOSTICS.Just don't automatically discount the NIC, though. As mentioned, the Realtek devices can work okay and then start to get flaky when traffic loads increase. Lots of Google search results detailing that.
-
@bmeeks said in CPU 100%, unbound and dhcpd restarting whenever the filter reloads:
@pfuser23984 said in CPU 100%, unbound and dhcpd restarting whenever the filter reloads:
I've been on Pfsense 2.7.2 since April or May 2024, no issues. This just started in December 2024.
I am going to check any packages or patches installed or updated around that time.If you can definitely pinpoint the start of the issue as December 2024, then certainly you need to start by looking at all changes made to the firewall since that date. You can examine the configuration changes by looking at the diffs of the automatic
config.xml
backups under DIAGNOSTICS.Just don't automatically discount the NIC, though. As mentioned, the Realtek devices can work okay and then start to get flaky when traffic loads increase. Lots of Google search results detailing that.
When you installed the latest kmod driver, did you follow the steps outlined in this post:
https://forum.netgate.com/topic/160529/realtek-nic-and-watchdog-timeout/13?SOMMOMMA!@##@!!
That did it.
I am used to linux where loading kernel drivers is easy to do and easy to verify. I did ithe install withpkg install realtek-re-kmod
and rebooted... but theecho 'if_re_load="YES"' >> /boot/loader.conf.local
was needed to load the new driver. Not really an intuitive process.I ran through my tests, and the problem is gone now. I've even restored gateway monitoring, patches and watchdog. The rc.newwanip still does its thing, but the re1 NIC no longer flaps, the dhcpd / unbound services no longer crash, the CPU no longer spikes making the system unusable until php-fpm is restarted.
Thank you so much.
-
@pfuser23984 said in CPU 100%, unbound and dhcpd restarting whenever the filter reloads:
Just don't automatically discount the NIC, though. As mentioned, the Realtek devices can work okay and then start to get flaky when traffic loads increase. Lots of Google search results detailing that.
When you installed the latest kmod driver, did you follow the steps outlined in this post:
https://forum.netgate.com/topic/160529/realtek-nic-and-watchdog-timeout/13?SOMMOMMA!@##@!!
That did it.
I am used to linux where loading kernel drivers is easy to do and easy to verify. I did ithe install withpkg install realtek-re-kmod
and rebooted... but theecho 'if_re_load="YES"' >> /boot/loader.conf.local
was needed to load the new driver. Not really an intuitive process.I ran through my tests, and the problem is gone now. I've even restored gateway monitoring, patches and watchdog. The rc.newwanip still does its thing, but the re1 NIC no longer flaps, the dhcpd / unbound services no longer crash, the CPU no longer spikes making the system unusable until php-fpm is restarted.
Thank you so much.
Glad that fixed it for you .