Pfsense system crash
-
I have been using pfsense on the same HW since late 2020 trouble free.
My system info is...
BIOS Vendor: American Megatrends Inc.
Version: P1.30
Release Date: Fri May 4 2018
Version 22.01-RELEASE (amd64)
built on Mon Feb 07 16:37:59 UTC 2022
FreeBSD 12.3-STABLE
The system is on the latest version.
Version information updated at Fri Apr 1 18:01:44 CDT 2022
CPU Type Intel(R) Celeron(R) J4005 CPU @ 2.00GHz
Current: 1800 MHz, Max: 2001 MHz
2 CPUs: 1 package(s) x 2 core(s)
AES-NI CPU Crypto: Yes (active)
QAT Crypto: No
Hardware crypto AES-CBC,AES-CCM,AES-GCM,AES-ICM,AES-XTS,SHA1,SHA256
Kernel PTI Enabled
MDS Mitigation Inactive
Uptime 00 Hour 13 Minutes 58 Seconds
NIC HP NC364T (4 port)*In Feb of this year I upgraded to 2.6 and then to 22.01 I did not seem to have any issues. Then I learned that ZFS was the default file system so I reinstalled using ZFS instead of GPT. All was fine after the install. Then while on vacation I found that I could no longer contact my home media server and my daughter said there was no internet. When I arrived and home I found that the pfsense was just locked up. A hard reboot of the PC fixed it and all was working again. Looking at the logs I found that all the different logs just stopped at the time of the crash. Nothing was written to any log for the duration of the crash until the PC was rebooted. After the PC was rebooted pfsense continued working for a few days then crashed again, logs just stopping at the time of the crash. I figured I had a HW issue. Maybe RAM or HD but checking both of these showed zero errors after the tests. This crashing has happened twice this week. The CPU runs at 34-37 deg C at less than 5% most of the time. Memory usage is 7% of 7772 MiB
So, I decided to go back the GPT file system, which seemed to work fine before without issue, but I wish I had a better handle on what the problem is. I can't think of what else it could be.
I'm open to suggestions :)
-
Anything appear on the console when this happens?
Does it respond to
ctl+t
?Check the monitoring graphs after rebooting. Does it show the filesystem filling or RAM being used (a memory leak)?
Steve
-
-
@stephenw10
Thanks for the reply.When the system was in the hanged state I attempted to connect a monitor and keyboard but could not see anything on the display. I tried ctl+alt+del but got no response. At this point I forced a hard reset.
After reboot I did not notice any high RAM or CPU usage on the graph.
I intend to let the system run as long as it will to see if it will fail again using the GPT file system. At the end of the week, if nothing happens, I may try loading pfSense using ZFS and see if I can get it to crash. I left the monitor and keyboard connected so I can try the command you suggested.
-
PfSense has been running for 4 days now without any issues using the GPT UFS file system. I'll probably go another 2 days if nothing fails and then reinstall pfSense using ZFS files system.
Assuming I do have a failure with ZFS again, and the system locks up, do you have any suggestions what to try after I type ctl+t ?
-
If it shows anything there we can look at what was hanging.
If not the best bet is to leave the console connected or log the output on a serial console. Some failures will show errors there and nowhere else. Disk/filesystem errors can be unable to log to disk.
Steve
-
@stephenw10 Thanks.
Went ahead and reinstalled pfSense using default ZFS except I bumped up the swap to 4G. Everything is configured as before. The only change in system resources so far is an increase in memory usage from 8% with GPT to 18% for ZFS.
So, now just wait and see. Monitor and keyboard are still connected.
-
With the latest pfSense plus and ZFS file system install I ran into a system freeze again. This time I noticed that remote desktop was acting sluggish so I tried to log in to the pfSense web interface but the web browser said the site could not be reached. However, I still was able to use the console and performed a reboot. Then all was well after the reboot.
From a shell ran the "top" command and discovered that ntopng normally used less that 1% CPU usage, but occasionally the CPU usage would jump to 65%.
I am suspecting now that ntopng has a memory leak, or maybe it just needs more horsepower than my CPU can provide at times and causes a crash.
So, I disabled ntopng service and am going to see if that fixes my system freezing.
-
That's a good test. You should be able to see a memory leak in ntop-ng in the top output though.
It could also be ntop struggling due to traffic from some other issue.
Steve
-
@vcr58 Hi there. I have the same problem.
I'm also using a J4005 NUC and also tried everything on an an J5005 NUC.
I can reproduce this behaveiour by downloading a large file on one VLAN or moving large Files from one VLAN to another. I have 6 VLANS configured.
I've tried trafficshaping for bufferfloat, disabled hardware offloading, reinstalled and RECONFIGURED the fw from scratch (without any services running) and also baught a new nuc just to be sure. Same problem.
This problem started occuring 3-5 Weeks ago -
@fim said in Pfsense system crash:
This problem started occuring 3-5 Weeks ago
Like spontaneously or after an upgrade? Some other change?
-
@stephenw10
Thinking back, the firewall did start behaving "less reliable" after the 2.5.0 upgrade all in all. I thought the hardware was to weak so I bought the J5005 NUC. Same problem.The turningpoint was after adding 3 more OpenVPN Servers. After that the problem I described occured after every stress test.
-
So you are also only seeing latency issues when running ZFS?
I assume you're running 2.6 now?
-
@stephenw10 I did some more testing with a clean configuration for the last 2 hours.
Hardware: Intel NUC J5005
configuration:- 3 VLANS (1-WAN PPPoE ; 2-LAN; 3-OPT1)
- PPPoE is bridged
- I also tried to use the ISP modem (zyxel xmg3927-b50a) as a router
It doesn't matter if ZFS or not. Transfering a large file (30GB ) from LAN to OPT1 lets the firwall crash after few seconds. It takes longer until it crashes if WAN has no config. Nothing in the logs as it is a full system crash that requiers a hardreboot.
Yes, I'm running 2.6
-
So it just appears to lock up? No response at the console? Even to Ctl+t ?
And there is no crash report shown when it reboots?
-
@stephenw10 said in Pfsense system crash:
That's a good test. You should be able to see a memory leak in ntop-ng in the top output though.
It could also be ntop struggling due to traffic from some other issue.
Steve
With ntop-ng disabled I have not had any issues so far. A YT video I watched recommended that ntop-ng not be running all the time anyway since it is somewhat of of resource hog.
Thanks.
-
@stephenw10 I have been running pfsense now for 22 days but just now am not being served any ip addresses and cannot log in. My cable modem says everything is fine so it's pfsense not running. CTL+t does respond with
"load: 0.00 cmd: login 53946 [tx->tx_sync_done_cv] 1881545.20r 0.00u 0.00s 0% 2708k"
Anything else I should do? I know a reboot will fix it for a while.
-
Here is a pic of the monitor connected and the output before the crash.
-
@vcr58 said in Pfsense system crash:
Here is a pic of the monitor connected and the output before the crash.
Your issue is related to write-io to the system disk. It seems your disk goes missing/not responding. This also supports why you get better stability without NtopNG as that package in particular does A LOT of write I/O.
Strange that it only happens with ZFS and not UFS. But ZFS uses a very different write strategy, and is a quite write intense in bursts opposed to UFS. So it would seem your SSD/eMMC/HDD is the culprit. Please remember that especially eMMC and NTopNG is not a good match as the write endurance could be worn out in a matter of a year or two.
-
Mmm, that looks like a bad/failing disk. It should never stop responding like that.
I would replace it and restest when you can.
Steve
-
@stephenw10 - I suppose it could be the SSD going bad although I never get any errors when running a scan on it. What @keyser said does make sense to me as well.
I do have an older WD SSD green that I could try so I will try that one and see what happens.
Thanks