SG-3100 stops responding every 2 days on 24.03
-
We have a 3100 was upgraded to 24.03 on May 28th, and since then it stops responding every ~2 days and needs to be power-cycled to get it working. This has happened 3 times now.
The last entries in the system log are
kernel: fq_codel_new_sched cannot allocate memory for fq_codel configuration parameters kernel: si_new new_sched error
CoDel limiters were configured but yesterday I deleted the traffic limiters and floating firewall rules, which obviously didn't help.
The firewall is monitored using Zabbix, and looking at the graphs shows that memory usage climbs steadly from reboot to the time the firewall stops responding:
The free memory is above 1.2GB during this time:
Other graphs don't seem to indicate anything that would explain this (MBUFs and states remain relatively flat).
I've attached the other Zabbix graphs in case they are useful:
graphs.zipAfter reboot, the dashboard had a PHP crash error log that contained 3 lines and then 8 lines of
PHP Fatal error: Unable to start pfSense module in Unknown on line 0
I also have the logs from the device, but I don't want to post them publicy as I'd need to go through and remove any sensitive information first. I can PM them to someone if they want to have a look.
-
Run
top -HaSP
and sort by memory usage. See what is growing. -
@stephenw10
I see that since the upgrade to 24.03, the system log if full of these messages every 10-30 seconds:ugen1.2: <CPS ST Series> at usbus1 ugen1.2: <CPS ST Series> at usbus1 (disconnected)
From what I can tell, this is a Cyberpower UPS connected to the USB port. I've found some reports that some USB devices, including some Cyberpower units, will reset if they don't establish a connection within a certain amount of time.
Could this repeated USB connection & disconnection cause a memory leak of some kind?
-
Potentially it could. Try unplugging it and see if the leak stops.
I assume you are running NUT? Do you have the current package version installed?
-
@stephenw10 The UPS is at a remote site, so I can't unplug the UPS.
I was able to disable the USB port using usbconfig and that has stopped the log entries.I tried to setup NUT but it wouldn't connect to the UPS. I don't know if it couldn't connect because the USB connection kept resetting so frequently or if that is unrelated. I tried changing the polling settings to some suggestions I found but that didn't help. NUT is currently uninstalled.
-
Hmm, OK, interesting. Well I guess you'll know in a few hours.
-
@andrew_cb I vaguely recall posts about Zabbix being a problem of some kind, but I don't use it and couldn't find it in a quick search. This may just be a red herring, and if so I apologize, but you could try disabling that for a while and just looking at pfSense's graphs.
-
Could be this if Zabbix is using SNMP: https://redmine.pfsense.org/issues/15481
-
We have Zabbix on 40 other Netgates without issue, including two SG-3100 running 24.03.
I don't think we're doing any SNMP monitoring, just Zabbix Agent (active) and Zabbix Proxy both running on the firewalls.
Memory usage is holding flat (it's actually decreased slightly) so it might be that disabling USB to workaround the UPS issues is the fix?
I will be interesting to see how it looks tomorrow morning.
-
Mmm, interesting indeed. It's not something we'd ever normally see so it could have a leak that's simply never been hit.
-
So disabling the "flapping" USB seems to have resolved the escalating memory usage. I don't know if the non-responsive issue is resolved though, as we replaced the affected unit earlier today with a 4100 because couldn't "see how it goes" and risk any further interruptions and downtime at this customer.
Past 90 days. The memory usage on all 10 of our SG-3100 units was flat and nearly identical.
Past 11 days. The affected unit was upgraded on 05/28 and the USB ports were disabled on 06/04, and the memory usage remained flat afterward.
I should be able to play with the 3100 next week and will try to reproduce the issue on the test bench. Hopefully, that will shed light on what's happening and lead to identifying the root cause.
-
Interesting. How did you disable the USB exactly in that case?
-
I don't recall the exact command but it was something like
usbconfig -i ugen0.2 detach_kernel_driver