pfSense became unresponsive, then no DNS resolution after reboot
-
I left my computer for 5 minutes and when I came back, there was no internet connection. pfSense did not respond to ping, I could not ssh into it and there was the web interface timed out.
I had to press the power button on the Protectli hardware. I waited it to shutdown. Then waited 30 seconds before pressing it again.
Eventually pfSense came back, responds to pings, ssh and webUI works.
I immediately started searching on the internet using Brave search, and it resolved brave and reddit, but then nothing else.In pfSense there is only 1 DNS server configured (besides 127.0.0.1), which is a locally hosted Pi-hole on a separate machine, running in docker container. This Pi-hole uses unbound (also docker container) as its upstream DNS server. In Pi-hole's logs I can see that for the domain queries it responded with SERVFAIL. I did not have much time to troubleshoot this as we needed internet, so I just rebooted both Pi-hole and unbound containers at the same time, and this solved the DNS issue. But I find it strange that after pfSense reboots, Pi-hole/unbound on another machine stop serving DNS.
What logs should I look at and how in pfSense? The top priority for me is to figure out why it stopped working in the first place. (There is still plenty of free disk space on it.)
It is maybe worth mentioning that this is the first time it did this, and it is a relatively fresh install, only 2 months old. Also yesterday was the first time we experienced a power outage, so all hardware was stopped abruptly, but then everything worked again after power came back.
Thank you.
-
Do you see anything logged in pfSense in the run up to the outage?
Are local clients setup to use the pihole directly?
I'd expect if pfSense stopped responding then the other devices behind it might lose dhcp leases for example. Or lose a route perhaps? Hard to say at this point but I'd check the logs on those hosts at the time too.
-
@stephenw10 I can't see anything outstanding in the logs. But I might be missing something. Maybe I should dig into logs via the command line and not just look at the GUI.
However, I found something that might be the cause. I am using HAProxy and I recently enabled the stats logging. Using ps aux I could see that HAProxy is consuming lot of memory. After some time I checked again and its memory consumption increased. So at this point my theory is that it caused an out of memory error. I disabled stats and since then the memory usage stays on the same level. Fingers crossed.
BTW, all (known) devices on the network have static IPs. PfSense hands out pihole's IP to clients and I can confirm they are using pihole directly. Now that I think about, the DNS resolution issue is now even more mysterious, as pfSense should not be involved in DNS lookups.
-
Check the graphs in Status > Monitoring. If there as memory exhaustion it should have been recorded there.
-
@Sherwatt said in pfSense became unresponsive, then no DNS resolution after reboot:
was the first time we experienced a power outage
Next time when you boot up pfSense, do so while you are watching, following the boot process from the console, the serial access with the small wire. You'll know right away if there is a problem.
Also : when pfSense doesn't seem to react : connect to serial (console) interface first.
Resetting or ripping out the power is like a Russian roulette "head shot".
SSH access is the next best, but it needs 'interfaces' to work. Not being able to ssh in is already a 'bad' sign by itself. See it like this : nearly every device on the planet depends on SSH, and it's pretty rock solid. SSH not working is a big red flag. It could be as simple as the "Login protection" has excluded you after several login (password) errors, but you better be sure right away = try logging in from another device.
The fact that pfSense handles (normally) DHCP, this is also a good sign that some parts are still working, but if all your devices use static IP settings, you 'miss' this check = run ipconfig /all on your PC, or check if your device re obtained a DHCP lease after removing the connection for a short time.When you install pfSense packages like HAProxy, it becomes important that you check regularly the system resources. After all, when RAM fills up, pfSense can start swapping and that's something you really do not want to happen, as the system might elect a random (the process using the most RAM) process and kill it. This will most probably have an impact as every process is essential. This "killing" will get signaled in the system log.
And yeah, an UPS can pay itself back without you knowing about it ;)
-
I'm just checking the Monitoring graph and memory consumption seems normal, nothing outstanding there.
However the States started increasing 10 days ago. I am not even sure I understand what States are, but I guess I need to see what I changed 10 days ago and see if is related.EDIT: that big spike is NOT when the issue happened, that spike is actually 24 hours before that. So maybe it is not even States that caused it.
Thanks for the tips @Gertjan, I will try to remember to use the serial access first. But it also depends how quickly the household needs internet as two people are working from home here.
And yes, after this outage I definitely want to buy a UPS, I just need to do some research, because I have never used one. -
Yeah I doubt it's a states problem. 4000 states really isn't that much. Odd that it spiked like that though. Do you have any sort of content sharing applications running? bit torrent creates a lot of states for example.
-
@stephenw10 Yes, I am running qBittorrent in a container.
Can I run some kind of error checking and fixing command on pfSense to look for potentially corrupted files on the disk? Maybe the outage caused a corruption somewhere on the filesystem which is rarely accessed, but when fails, the whole system crashes.
-
Is it UFS or ZFS?
-
@stephenw10 I asked ChatGPT the same question as in my previous post and after a short chat it turns out it is ZFS:
$ zpool status -v pool: pfSense state: ONLINE config: NAME STATE READ WRITE CKSUM pfSense ONLINE 0 0 0 mmcsd0p4 ONLINE 0 0 0 errors: No known data errors
Is there anything else I could use to retroactively diagnose the problem? I already fed the boot log to ChatGPT to look for errors, but it didn't find anything scary. Should I share it with you and if yes, is pasting it in a post acceptable?
-
Then you can run a zfs pool scrub:
zpool scrub pfSense
https://docs.netgate.com/pfsense/en/latest/troubleshooting/filesystem-check.htmlYou can upload the logs here and I can look at them:
https://nc.netgate.com/nextcloud/s/zgpTGfKio3Fa5eb -
@stephenw10 Thank you. I uploaded boot.txt.
[2.7.2-RELEASE][admin@pfSense.lan.mydomain.com]/root: zpool scrub pfSense [2.7.2-RELEASE][admin@pfSense.lan.mydomain.com]/root: zpool status pool: pfSense state: ONLINE scan: scrub repaired 0B in 00:00:10 with 0 errors on Wed Mar 19 15:11:17 2025 config: NAME STATE READ WRITE CKSUM pfSense ONLINE 0 0 0 mmcsd0p4 ONLINE 0 0 0 errors: No known data errors
-
That's just the boot log from after the outage happened.
We need to see the system covering the event. So from at least some hours before until and including the reboot.
You should disable the on-board audio device though. It just uses resources and does nothing in pfSense.
hdacc0: <Intel Jasper Lake HDA CODEC> at cad 2 on hdac0 hdaa0: <Intel Jasper Lake Audio Function Group> at nid 1 on hdacc0
-
@stephenw10 Thank you for looking into my issue. I uploaded system.log twice, because I messed up the first one. I guess this is what I should be looking at, right? (from /var/log).
I think the issue happened around 17:45 (March 18). I left my computer around 17:40 and when came back pfSense was dead.I should disable the audio device in UEFI, right?
-
@Sherwatt said in pfSense became unresponsive, then no DNS resolution after reboot:
I should disable the audio device in UEFI, right?
Yup somewhere in the EFI/BIOS setup you should be able to disable it completely.
-
Mmm, nothing really shown in the logs at all:
Mar 18 17:17:00 pfSense sshguard[62427]: Now monitoring attacks. Mar 18 17:26:00 pfSense sshguard[62427]: Exiting on signal. Mar 18 17:26:00 pfSense sshguard[44994]: Now monitoring attacks. Mar 18 17:35:00 pfSense sshguard[44994]: Exiting on signal. Mar 18 17:35:00 pfSense sshguard[31294]: Now monitoring attacks. Mar 18 17:44:00 pfSense sshguard[31294]: Exiting on signal. Mar 18 17:44:00 pfSense sshguard[11995]: Now monitoring attacks. Mar 18 17:47:09 pfSense syslogd: exiting on signal 15 Mar 18 17:48:38 pfSense syslogd: kernel boot file is /boot/kernel/kernel Mar 18 17:48:38 pfSense kernel: ---<<BOOT>>--- Mar 18 17:48:38 pfSense kernel: Copyright (c) 1992-2023 The FreeBSD Project. Mar 18 17:48:38 pfSense kernel: Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 Mar 18 17:48:38 pfSense kernel: The Regents of the University of California. All rights reserved. Mar 18 17:48:38 pfSense kernel: FreeBSD is a registered trademark of The FreeBSD Foundation. Mar 18 17:48:38 pfSense kernel: FreeBSD 14.0-CURRENT amd64 1400094 #1 RELENG_2_7_2-n255948-8d2b56da39c: Wed Dec 6 20:45:47 UTC 2023 Mar 18 17:48:38 pfSense kernel: root@freebsd:/var/jenkins/workspace/pfSense-CE-snapshots-2_7_2-main/obj/amd64/StdASW5b/var/jenkins/workspace/pfSense-CE-snapshots-2_7_2-main/sources/FreeBSD-src-RELENG_2_7_2/amd64.amd64/sys/pfSense amd64 Mar 18 17:48:38 pfSense kernel: FreeBSD clang version 16.0.6 (https://github.com/llvm/llvm-project.git llvmorg-16.0.6-0-g7cbf1a259152)
If nothing is logged at reboot like that it can be a hardware issue.
I assume you didn't see a crash report after rebooting? It doesn't look like you have SWAP configured so you wouldn't see one if it panicked.
-
@stephenw10 Thank you for your time looking into the logs. I did not see any crash reports. Do you think I should configure swap in pfSense in case this happens again?
-
You would need to re-install to do so. But that would then give you a crash report if it was the result of a kernel panic.
-
@stephenw10 Then I'm just going to stick with my current setup and see if there is anything on the console the next time this happens, if happens.
Thank you for your help, much appreciated! -