DNS/DHCP stop working suddenly
-
And nowhere near full?
-
@stephenw10 Its in good shape
[23.09.1-RELEASE][admin@GAFW]/root: df -H Filesystem Size Used Avail Capacity Mounted on pfSense/ROOT/UseThis_Current 93G 1.8G 91G 2% / devfs 1.0k 0B 1.0k 0% /dev pfSense/tmp 91G 602k 91G 0% /tmp pfSense/var 91G 42M 91G 0% /var pfSense/home 91G 111k 91G 0% /home pfSense 91G 98k 91G 0% /pfSense pfSense/var/tmp 91G 492k 91G 0% /var/tmp pfSense/var/log 91G 87M 91G 0% /var/log pfSense/var/db 93G 1.3G 91G 1% /var/db pfSense/var/cache 91G 9.7M 91G 0% /var/cache pfSense/reservation 103G 98k 103G 0% /pfSense/reservation pfSense/ROOT/UseThis_Current/cf 91G 7.5M 91G 0% /cf pfSense/ROOT/UseThis_Current/var_cache_pkg 91G 233M 91G 0% /var/cache/pkg pfSense/ROOT/UseThis_Current/var_db_pkg 91G 11M 91G 0% /var/db/pkg tmpfs 4.2M 332k 3.9M 8% /var/run /lib 93G 1.8G 91G 2% /var/unbound/lib devfs 1.0k 0B 1.0k 0% /var/unbound/dev /var/log/pfblockerng 91G 87M 91G 0% /var/unbound/var/log/pfblockerng /usr/local/share/GeoIP 93G 1.8G 91G 2% /var/unbound/usr/local/share/GeoIP /usr/local/bin 93G 1.8G 91G 2% /var/unbound/usr/local/bin /usr/local/lib 93G 1.8G 91G 2% /var/unbound/usr/local/lib devfs 1.0k 0B 1.0k 0% /var/dhcpd/dev
-
Yup no problem there. Hmm, hard to say then without any logs to check from the time. It really looks like a disk issue though from those symptoms.
-
@stephenw10
It's possible. Seeing how all logs get pumped to my syslog collector is there something i should search for?
Its also possible that because im missing .bz2 logs as i showed above they probably didnt make it to syslog. -
@stephenw10
There was also something i noticed. So to recover i did a hard reboot. Unplug and plug.
I didnt do any checkouts because all my alerts cleared but later on in the evening the wife couldnt connect to captive portal which is weird because there is no captive portal enabled on that interface.
-- if you recall i had a CP issue about a week ago and i was tshooting :)So it seems that pfsense loaded a previous config and not the latest config. It was simple to load the recent config from back up and restore but why did an old config load?
-
It will revert to the last good config if the current config file is unreadable. That could also indicate a drive issue.
-
@stephenw10 OK now im nervous.
What are my options to either replace the 6100 or replace the SSD? -
This is the first time you've seen this issue? It might not be the disk, I would want to see something conclusive. If it is you could just replace the disk though.
-
@stephenw10
It happened again..
Got some more data that may or may not help.- Internet suddenly goes out so i go straight into console.
- Nothing on the console screen. Usual output for selecting options.
WAN port loses IP address
Mar 1 15:22:59 GAFW dhclient[38626]: No working leases in persistent database - sleeping. Mar 1 15:22:59 GAFW dhclient[30526]: FAIL Mar 1 15:23:00 GAFW dhclient[38626]: connection closed Mar 1 15:23:00 GAFW dhclient[38626]: exiting. Mar 1 15:23:01 GAFW dhclient[41673]: PREINIT Mar 1 15:23:01 GAFW dhclient[43008]: EXPIRE Mar 1 15:23:01 GAFW dhclient[44238]: Deleting old routes Mar 1 15:23:01 GAFW dhclient[45556]: PREINIT Mar 1 15:23:01 GAFW dhclient[41096]: DHCPDISCOVER on ix3 to 255.255.255.255 port 67 interval 2 Mar 1 15:23:03 GAFW dhclient[41096]: DHCPDISCOVER on ix3 to 255.255.255.255 port 67 interval 5 Mar 1 15:23:08 GAFW dhclient[41096]: DHCPDISCOVER on ix3 to 255.255.255.255 port 67 interval 7
CPU usage using top -aSH command.
last pid: 16707; load averages: 4.63, 3.67, 2.21 up 0+15:43:45 15:11:13 519 threads: 9 running, 472 sleeping, 38 waiting CPU: 2.0% user, 0.0% nice, 26.4% system, 0.0% interrupt, 71.7% idle Mem: 398M Active, 1576M Inact, 1347M Wired, 4465M Free ARC: 319M Total, 119M MFU, 187M MRU, 875K Anon, 2154K Header, 10M Other 261M Compressed, 759M Uncompressed, 2.91:1 Ratio Swap: 1024M Total, 1024M Free PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND 0 root -60 - 0B 2128K CPU3 3 13:49 99.92% [kernel{if_io_tqg_3}] 11 root 187 ki31 0B 64K RUN 2 872:54 95.83% [idle{idle: cpu2}] 11 root 187 ki31 0B 64K CPU1 1 867:56 95.08% [idle{idle: cpu1}] 11 root 187 ki31 0B 64K CPU0 0 878:02 93.18% [idle{idle: cpu0}] 60326 root 68 0 114M 51M accept 1 0:02 8.47% php-fpm: pool nginx (php-fpm) 81924 www 20 0 43M 23M kqread 1 12:44 1.15% /usr/local/sbin/haproxy -f /var/etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid -D -x /tmp/haproxy.socket -sf 465 20240 unbound 20 0 192M 144M kqread 0 0:01 0.60% /usr/local/sbin/unbound -c /var/unbound/unbound.conf{unbound} 42858 root 20 0 13M 3084K select 2 4:35 0.57% /usr/sbin/syslogd -O rfc3164 -s -c -c -l /var/dhcpd/var/run/log -l /tmp/haproxy_chroot/var/run/log -P /var/run/sy 20240 unbound 20 0 192M 144M kqread 1 0:01 0.53% /usr/local/sbin/unbound -c /var/unbound/unbound.conf{unbound} 63614 root 20 0 1196M 733M uwait 1 4:34 0.48% /usr/local/bin/suricata --netmap -D -c /usr/local/etc/suricata/suricata_12791_igc1/suricata.yaml --pidfile /var/r 0 root -64 - 0B 2128K - 1 1:21 0.28% [kernel{dummynet}] 0 root -12 - 0B 2128K - 1 0:41 0.23% [kernel{z_wr_iss_0}] 48900 root 20 0 13M 3520K bpf 2 2:43 0.22% /usr/local/sbin/filterlog -i pflog0 -p /var/run/filterlog.pid 0 root -12 - 0B 2128K - 1 0:41 0.21% [kernel{z_wr_iss_1}]
-
What is also strange is that i could not resolve anything local. My own self hosted services would time out..
C:\Users\michm>nslookup syslog.xxxxxxx Server: GAFW.xxxxxxx Address: 192.168.50.254 DNS request timed out. timeout was 2 seconds. DNS request timed out. timeout was 2 seconds. *** Request to GAFW.xxxxxxxxx timed-out
I went ahead and went for the reboot through console. It could not stop the Tailscale package. This forced me to pull the plug and plug it back in.
Netgate pfSense Plus will reboot. This may take a few minutes, depending on your hardware. Do you want to proceed? Y/y: Reboot normally R/r: Reroot (Stop processes, remount disks, re-run startup sequence) S: Reboot into Single User Mode (requires console access!) Enter: Abort Enter an option: r Netgate pfSense Plus is rerooting now. Stopping package arpwatch...done. Stopping package freeradius3...done. Stopping package lldpd...done. Stopping package WireGuard...done. Stopping package haproxy...done. Stopping package nut...done. Stopping package syslog-ng...done. Stopping package Tailscale... <----- stuck here and could not go past
I eventually lost access to the web configurator and got the nginx timed out message.
-
OK interesting. Not the SSD then, so that's good.
One CPU core seems to be stuck at 100%. We can only see it's interface io traffic so something attempting (and seemingly failing) to send or receive. From the fact tailscale could not be stopped I'd guess it's that somehow. Are you able to disable that as a test?
-
@stephenw10
I have uninstalled Tailscale.
I’m glad I got some more data on this so thanks for the assistance.If this is a bug - let’s say it might be for sake of conversation - I don’t have much data to prove this isn’t just environmental to my setup. Any other data points I can gather?
Losing a WAN ip address does make Tailscale chatty but i can’t find the link to why dns breaks for local resolution. -
It looks like whatever is using that CPU core is preventing anything sending/receiving traffic or possible opening states. Causing problem for numerous services.
-
@stephenw10
Interesting and i love that theory. Seems very logical.
Would explain why almost all functionality related to the firewall is broken - web-configurator , dns , dhcp...
Considering the 6100 has 4x cores, its not odd that a single core choking the firewall breaks connectivity?edit: So whatever is using that core is related to traffic? High throughput? This 6100 can for sure push well over 10G from my testing but looking at the monitoring i see pretty low WAN throughput considering.
Tailscale seems like the smoking gun but checking my historical syslog it seems that once ix3 (WAN) goes down and cant renew a dhcp address, tailscale becomes very verbose. As i discovered today i am unable to reboot the router because it gets stuck on the tailscale process. High level but the trigger is losing the WAN.
If we're stable this weekend, i will install tailscale again and unplug the wan..see if i can reproduce. -
@michmoor said in DNS/DHCP stop working suddenly:
Considering the 6100 has 4x cores, its not odd that a single core choking the firewall breaks connectivity?
It is I agree. Though I'm assuming it's choking something in the network stack so it doesn't matter than then other cores are still available.
-
@michmoor said in DNS/DHCP stop working suddenly:
I went ahead and went for the reboot through console. It could not stop the Tailscale package. This forced me to pull the plug and plug it back in.
what's the longest you waited for the system to shutdown gracefully when it appeared to be 'stuck" stopping Tailscale package? conservative estimate in minutes...
-
@cyberconsultants
maybe 5mins or so.
Reboots generally do not take long@stephenw10 so far since the TS package removal things have been stable since. I’ll add the package tomorrow and let you know how things go.
-
@michmoor five minutes doesn't seem reasonable given documented precaution:
https://docs.netgate.com/pfsense/en/latest/diagnostics/system-halt.html
...and what you seemed to be dealing with more specifically.
abrupt and repeated power losses over time might be closer in line with your 'root cause' than any particular package. i personally think it'd be advisable to make not cutting the power until you've waited at least an unreasonable period of time given the specific reboot context (sysem updates, maintenance, system crash, etc.) a more regular practice.
pfSense is a software firewall that doesn't load its config strictly out of NVRAM like typical consumer devices often do. despite filesystem and backup/recovery/snapshot improvements over time, OS corruption is still a real concern with abrupt loss of system power.
-
In reality I don't think I've ever seen a correctly functioning install take more than 5 mins to shutdown.
-
@stephenw10 said in DNS/DHCP stop working suddenly:
In reality I don't think I've ever seen a correctly functioning install take more than 5 mins to shutdown.
so you would pull the physical power after five minutes of
Stopping package Tailscale...
in some kind of unstable system state?