DNS/DHCP stop working suddenly
-
It looks like whatever is using that CPU core is preventing anything sending/receiving traffic or possible opening states. Causing problem for numerous services.
-
@stephenw10
Interesting and i love that theory. Seems very logical.
Would explain why almost all functionality related to the firewall is broken - web-configurator , dns , dhcp...
Considering the 6100 has 4x cores, its not odd that a single core choking the firewall breaks connectivity?edit: So whatever is using that core is related to traffic? High throughput? This 6100 can for sure push well over 10G from my testing but looking at the monitoring i see pretty low WAN throughput considering.
Tailscale seems like the smoking gun but checking my historical syslog it seems that once ix3 (WAN) goes down and cant renew a dhcp address, tailscale becomes very verbose. As i discovered today i am unable to reboot the router because it gets stuck on the tailscale process. High level but the trigger is losing the WAN.
If we're stable this weekend, i will install tailscale again and unplug the wan..see if i can reproduce. -
@michmoor said in DNS/DHCP stop working suddenly:
Considering the 6100 has 4x cores, its not odd that a single core choking the firewall breaks connectivity?
It is I agree. Though I'm assuming it's choking something in the network stack so it doesn't matter than then other cores are still available.
-
@michmoor said in DNS/DHCP stop working suddenly:
I went ahead and went for the reboot through console. It could not stop the Tailscale package. This forced me to pull the plug and plug it back in.
what's the longest you waited for the system to shutdown gracefully when it appeared to be 'stuck" stopping Tailscale package? conservative estimate in minutes...
-
@cyberconsultants
maybe 5mins or so.
Reboots generally do not take long@stephenw10 so far since the TS package removal things have been stable since. I’ll add the package tomorrow and let you know how things go.
-
@michmoor five minutes doesn't seem reasonable given documented precaution:
https://docs.netgate.com/pfsense/en/latest/diagnostics/system-halt.html
...and what you seemed to be dealing with more specifically.
abrupt and repeated power losses over time might be closer in line with your 'root cause' than any particular package. i personally think it'd be advisable to make not cutting the power until you've waited at least an unreasonable period of time given the specific reboot context (sysem updates, maintenance, system crash, etc.) a more regular practice.
pfSense is a software firewall that doesn't load its config strictly out of NVRAM like typical consumer devices often do. despite filesystem and backup/recovery/snapshot improvements over time, OS corruption is still a real concern with abrupt loss of system power.
-
In reality I don't think I've ever seen a correctly functioning install take more than 5 mins to shutdown.
-
@stephenw10 said in DNS/DHCP stop working suddenly:
In reality I don't think I've ever seen a correctly functioning install take more than 5 mins to shutdown.
so you would pull the physical power after five minutes of
Stopping package Tailscale...
in some kind of unstable system state? -
If the console was hung at that point then yes. I might try hitting
ctl+t
first to see what it's actually waiting for. That won't allow it to continue but might give a clue as to why it failed. -
@stephenw10 said in DNS/DHCP stop working suddenly:
If the console was hung at that point then yes. I might try hitting
ctl+t
first to see what it's actually waiting for. That won't allow it to continue but might give a clue as to why it failed.lol
-
@stephenw10
Hey Stephen,
It happened again. I collected some data to help diagnoseI initially SSH'd to my firewall and was able to run a top -aSH.
last pid: 37748; load averages: 5.19, 5.30, 5.34 up 5+14:27:25 06:36:49 536 threads: 10 running, 488 sleeping, 38 waiting CPU: 1.1% user, 0.0% nice, 26.6% system, 0.1% interrupt, 72.3% idle Mem: 471M Active, 1981M Inact, 1400M Wired, 3937M Free ARC: 341M Total, 106M MFU, 214M MRU, 2972K Anon, 2640K Header, 14M Other 267M Compressed, 972M Uncompressed, 3.64:1 Ratio Swap: 1024M Total, 1024M Free PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND 0 root -60 - 0B 2160K CPU3 3 260:17 99.91% [kernel{if_io_tqg_3}] 11 root 187 ki31 0B 64K CPU0 0 127.5H 96.51% [idle{idle: cpu0}] 11 root 187 ki31 0B 64K CPU1 1 126.0H 95.49% [idle{idle: cpu1}] 11 root 187 ki31 0B 64K RUN 2 126.7H 95.47% [idle{idle: cpu2}] 98424 unbound 23 0 179M 139M kqread 1 0:06 3.65% /usr/local/sbin/unbound -c /var/unbound/unbound.conf{unbound} 98424 unbound 21 0 179M 139M kqread 2 0:05 2.39% /usr/local/sbin/unbound -c /var/unbound/unbound.conf{unbound}
I noticed that my WAN IP has changed. Being on ATT fiber it has remained the same for over two years but its DHCP so now its time to change.
All logging stopped at the same time
-rw------- 1 root wheel 572K Mar 7 02:29 filter.log.1.bz2 -rw------- 1 root wheel 534K Mar 7 02:35 filter.log.0.bz2 -rw------- 1 root wheel 214K Mar 7 02:37 resolver.log.1.bz2 -rw------- 1 root wheel 149K Mar 7 02:39 resolver.log.0.bz2 -rw------- 1 root wheel 2.9M Mar 7 02:39 auth.log -rw------- 1 root wheel 938K Mar 7 02:39 dhcpd.log -rw------- 1 root wheel 3.1M Mar 7 02:39 gateways.log -rw------- 1 root wheel 544K Mar 7 02:39 ntpd.log -rw------- 1 root wheel 2.6M Mar 7 02:39 nginx.log -rw------- 1 root wheel 2.8M Mar 7 02:39 routing.log -rw------- 1 root wheel 2.5M Mar 7 02:39 ipsec.log -rw------- 1 root wheel 2.5M Mar 7 02:39 openvpn.log -rw------- 1 root wheel 6.5M Mar 7 02:39 system.log -rw------- 1 root wheel 8.4M Mar 7 02:39 filter.log -rw------- 1 root wheel 426K Mar 7 02:39 resolver.log -rw------- 1 freeradius freeradius 879K Mar 7 02:40 radius.log -rw-r--r-- 1 root wheel 591B Mar 7 06:36 utx.lastlogin -rw------- 1 root wheel 22K Mar 7 06:36 utx.log
I still couldn't get to the GUI on the LAN side as it was unresponsive so from SSH I went ahead with the reboot. It was stalling on stopping certain packages so I did ctrl t
Netgate pfSense Plus is rebooting now. Stopping package arpwatch...done. Stopping package freeradius3...done. Stopping package lldpd...done. Stopping package WireGuard... load: 5.71 cmd: php_wg 65403 [nanslp] 10.49r 0.68u 0.07s 3% 53036k done. Stopping package haproxy...done. Stopping package nut...done. Stopping package syslog-ng...done. Stopping package softflowd...done. Stopping package suricata... Stopping package suricata... load: 5.59 cmd: php-cgi 56784 [nanslp] 35.07r 0.54u 0.04s 0% 53360k
Waiting over an hour...........yes an hour...pfsense never came back from the reboot and within that hour I lost access to the shell. So I went in through the console and the console was flooded with the following logs.
arpresolve: can't allocate llinfo for 192.168.1.254 on ix3
I couldn't access anything else via console. Just the flooding of the arpresolve log tied up anything. Even disconnecting the ix3 interface from the ATT modem didn't matter..arpresolve still kept flooding console.
Finally, a reboot fixed it. Back online..
-
@michmoor sounds like a simple gateway alarm/action triggering a cascade of headache. you might review configuration of the packages whose services you're not allowing to stop, likely compounding the kludge each time you do. it'll only get longer and kludgier if you keep "resolving" it the way you are.
Suricata initialization (both start and stop) is what's causing those
arpresolve
kernel notices. -
Do you have the service watchdog installed? That can cause problems with stopping services when incorrectly used. It should only really be used for debugging. You would see that logged though.
-
@stephenw10
When @cyberconsultants mentioned suricata i did immediately go to my service watchdog and I have the following enabled.This is a set up Ive had for years now. I don't mind removing these services from monitoring but i don't see how it prevented a reboot. Ive upgraded/rebooted many times with these enabled.
edit: I would also say that its likely something environmentally changed here but I have no idea what it could be. Short of adding a firewall rule, the config is static. Goes down for an upgrade every few months. So whatever is causing a cascade of headaches for me (love the term) its proving difficult to isolate.
-
Well the service watchdog should log if it restarts anything so I'd expect to see that in the system logs. Unless it's not logging anything of course.
It's an easy test to remove those though. Unbound is the only thing I could imagine being an issue though.
-
Very interesting diagnosing here, indeed and Steve is bad to the bone!
-
Not sure about that. Come back to me if/when we find the root cause.
-
@stephenw10 it's pretty obviously, based on OP's detailed posts and a fair bit of 'reading between the lines,' triggered gateway monitoring action following either transiest latency/loss of connecitivty or DHCP lease renewal.
the rest of it is just system tailspin—which in fairness would, in all likelihood, eventually recover given enough time to do so. BSD and pfSense be solid like that.
-
Yes, exactly I don't expect to just get hung up in the interface/network like that. You certainly can end up with a lot of script 'churn' when an interface drops out. Especially if there are other interfaces on it like VPNs etc. But it will stop after a few minutes unless the interface changes again.
-
@michmoor said in DNS/DHCP stop working suddenly:
@stephenw10
When @cyberconsultants mentioned suricata i did immediately go to my service watchdog and I have the following enabled.This is a set up Ive had for years now. I don't mind removing these services from monitoring but i don't see how it prevented a reboot. Ive upgraded/rebooted many times with these enabled.
edit: I would also say that its likely something environmentally changed here but I have no idea what it could be. Short of adding a firewall rule, the config is static. Goes down for an upgrade every few months. So whatever is causing a cascade of headaches for me (love the term) its proving difficult to isolate.
One of the issues with Service Watchdog is that the package is not super smart. It simply looks for the presence or absence of the monitored daemon, and sends it a start command if missing. Lots of things can result in normal restarts of certain daemons --
unbound
being one of those daemons. Consider this scenario:unbound
can be normally restarted by several things:-
If you have DHCP hostname registration enabled (most don't, but if you do) it will restart
unbound
each time a DHCP lease renews. But Service Watchdog is not aware of that. It will simply blindly seeunbound
not running at the instant it checks and then send it a restart command without knowing the service is already in the middle of restarting. -
A temporary issue with your WAN (packet loss, for example) might trigger a gateway alarm. If that deteriorates into a down condition, pfSense will issue a command to restart a bunch of processes including
unbound
. Again, Service Watchdog is ignorant of this process. It will seeunbound
missing from the list of running daemons when it spot checks, and thus issue a restart command. But pfSense itself is already restartingunbound
. That can lead to problems. -
If you run pfBlockerNG, it can also issue restart commands to
unbound
in some update scenarios. Service Watchdog is ignorant of this process as well, and will just blindly issue a restart command when it fails to see a runningunbound
daemon.
So, looking at the scenarios above you can see that it would be possible for bad things to happen if two
unbound
instances are trying to both start simultaneously. And whether or not this might occur would be random as it would depend on the exact timing differences between the natural restart process and when Service Watchdog sends it own independent restart signal.As @stephenw10 mentioned, Service Watchdog is really only intended for use when debugging a problem or working around some known issue. It is not a great solution for routine monitoring - especially monitoring of processes that can be restarted for legitimate reasons. This is why I preach to never use it with the IDS/IPS packages. Service Watchdog just does not consider if a service is legitimately restarting and thus the missing daemon instance is expected (and to be fair, it can't know this). It just says "no daemon present at this instant, so restart!".
-