SG3100 keeps locking up after latest update

tuser11

@stephenw10 I uploaded them there. It's unfiltered logs (1min 10sec) up to the system lock up: 2023-09-06 09:16:50.000 Until 2023-09-06 09:18:00.000

stephenw10

Those are all firewall logs except:

2023-09-06T09:17:00.000-04:00	pfSense.tstdomain	pfSense.tstdomain /usr/sbin/cron[13050] (root) CMD (/usr/sbin/newsyslog)
2023-09-06T09:17:00.000-04:00	pfSense.tstdomain	pfSense.tstdomain /usr/sbin/cron[12708] (root) CMD (/usr/local/pkg/servicewatchdog_cron.php)

What do you have enabled in the Service Watchdog? That can cause problems, it should only be used for testing.

Are those the only things in the system log?

Steve

tuser11

@stephenw10 That service was added by George Phillips when we paid for Snort integration. After he added Snort, snort kept crashing. He added the service watchdog (sometime between 16-20 June 2023) to make sure it restarted. Our lockups started back in April so I didn't think this was related. Also by the time we paid to have Snort added, there was only 1 other lockup so there was no trend yet. Otherwise i would have cancelled that project. We ran almost 1 month before our next lockup after that project was finished. Here are the only packages currently running on our box:

openvpn-client-export (added by me for our VPN users)
Status_Traffic_Totals (added by me for pfSense GUI traffic totals)
Service_Watchdog (added by George for snort)
snort (added by George for snort)
sudo (added by George for snort)

Those are the only things the system sent to our syslog server. I can add a larger range if it would be more helpful. Or i can download logs direclty from pfSense but i'm not 100% exactly what to download or if that must be done via ssh.

tuser11

I just downoaded logs from Diagnostics > Command Prompt > Download > /var/log/system.log
I can upload that.

tuser11

@stephenw10 i uploaded them to the last link

stephenw10

Mmm, so effectively nothing shown:

<38>1 2023-09-06T08:45:00.077941-04:00 pfSense.tstdomain sshguard 96081 - - Exiting on signal.
<38>1 2023-09-06T08:45:00.115084-04:00 pfSense.tstdomain sshguard 67726 - - Now monitoring attacks.
<43>1 2023-09-06T10:41:32.702804-04:00 pfSense.tstdomain syslogd - - - sendto: Network is unreachable
<6>1 2023-09-06T10:41:32.703510-04:00 pfSense.tstdomain syslogd - - - kernel boot file is /boot/kernel/kernel
<43>1 2023-09-06T10:41:32.703592-04:00 pfSense.tstdomain syslogd - - - sendto: Network is unreachable
<2>1 2023-09-06T10:41:32.703873-04:00 pfSense.tstdomain kernel - - - ---<<BOOT>>---

You manually power cycled it at 10:41?

Otherwise the only thing that really jumps out are some Snort warnings:

snort 28423 - - S5: Session exceeded configured max bytes to queue 1048576 using 1048890 bytes (client queue)

Those could probably be prevented with some tuning but they don't look to actually be causing a problem.

Steve

tuser11

@stephenw10 Yes, manual power cycle was at 10:41
The device connected to snort log 28423 is not necessarily a trusted device because i don't know what user it belongs to. Once we start registering static IP and mac for all devices I'll have a profile to connect to each device to make it easier for me to monitor things like that.

bmeeks

@tuser11 said in SG3100 keeps locking up after latest update:

That service was added by George Phillips when we paid for Snort integration. After he added Snort, snort kept crashing. He added the service watchdog (sometime between 16-20 June 2023) to make sure it restarted.

Service Watchdog should NEVER be used with either Snort or Suricata. It does not understand how those two IDS/IPS packages work internally, and thus Service Watchdog will needlessly submit multiple service restart commands for the IDS/IPS even when the IDS/IPS is already in the middle of restarting.

The correct approach would be to figure out why Snort was failing and address that issue. Service Watchdog is not the correct approach.

tuser11

@bmeeks i don't have enough knowledge to know when service watchdog should be deployed, I agree with "The correct approach would be to figure out why Snort was failing and address that issue".

My time is split between software design and development, software security, network security and many other things IT. I know how to deploy Snort (from docs and google) and i've done it in a home lab. Yet we paid for someone whos time is dedicated to network security (netgate team) in the hopes of having a more solid deployment. The more disciplines I cover, the easier it is for me to make mistakes. After I pointed out the problem to the assigned tech and he choose to deploy service watchdog instead of finding the root of the problem, it was too late for us. Bill was paid and I have to try again to find someone that prefers addressing the root problem.

The next move for us will be to remove the package (or do a fresh install and import config without it) if we confirm it is causing problems. Right now we're just trying to figure out what is causing the lockups. They started before snort was added.

tuser11

I'm even entertaining the idea that the brand new SG-3100 box might have had a defective power supply in it. The first box that was in production for years had a bad internal disk. I replaced the disk, confirmed it was working and swapped that box with a new SG-3100 (that was purchased years ago as a cold spare) and then put both the old SG-3100 and old power supply in a box on the shelf. Later i might disconnect the new power supply and connect the old one but it seems extreme and possibly a waste of time. It's kinda frustrating because it takes about 1-3 weeks before another lockup.

stephenw10

Mmm, I would say it's at least statistically unlikely to be a PSU issue. We really don't see many issues with them.

tuser11

@stephenw10 It froze again yesterday morning with relatively low traffic. Still nothing in console or any console ctrl, no errors in syslog. This time i had a script running to send additional system stats to remote syslog.
CPU Temperature: 76 | Load Average: 0.13, 0.20, 0.16 | Memory Usage: 20.00% | MBUF Usage: 1234/1296/2530/10035 | State Table Size: 1215

Here is what we've done:

Swapped SG-3100 unit and power supply after resolving the hard drive issue on the old SG-3100. Re-confirmed power supply was swapped by looking at asset id written on power supply matches asset id on SG-3100
Eliminated possibility of ground loop by confirming this system doesn't have ground difference between connected building, and re-confirming the electrical architecture related to the network hasn't changed in the 7 years we've been using pfSense (2 years via ESXi, 5 years with SG-3100), our issues started in April 2023
Doesn't appear to be firmware related considering i'm managing another SG-3100 at a different location with the same firmware and no problems
I left console connected so I could see output during failure, no output, not even with ctrl+t
Doesn't appear to be load or heat issue. Normal/Average temperature and loads reported 13 seconds before lockup: CPU Temperature: 76 | Load Average: 0.13, 0.20, 0.16 | Memory Usage: 20.00% | MBUF Usage: 1234/1296/2530/10035 | State Table Size: 1215. We've had much higher loads and temps as high as 83C without issues.
Removed non-essential packages (only leaving snort)
No errors in /var/log/system.log
I "think" we've eliminated the likelihood of the cause being a surge coming from upstream 4 port ISP modem. 2 Netgate boxes are connected to it side by side (SG-3100 for main office with static WAN IP and a SG-1100 for separate facilities with dynamic WAN IP). The SG-1100 hasn't been affected and hasn't had a lockup since it was installed in April. I'm thinking it was a coincidence this was installed the day after the first unexplained SG-3100 lockup.

I'm not sure what else to try or what I might be missing. My assumptions are:

If the cause is traffic (good, bad, malicious, etc) related from LAN or WAN, no matter how low level, an error would propagate to /var/log/system.log.
If the cause is heat related, the temp would not rise from normal, past warning, into fatal in under 15 seconds in a room where all other temperatures remain normal. That's why my load checks only run every 15 seconds.
If the cause is power related, it's not likely to follow a new power supply that was never used. Especially when it's plugged into an APC UPS.
If the cause is a Ethernet cable gone bad, we would also likely have traffic problems before total failure.

Are my assumptions wrong? Do any ideas jump out as logical next steps?

FSC830

I confess, I did not read the entire thread, just an overlook.
Are you using RAM disks? I have had a similiar issue some months ago when using RAM disks, see here,
but the interval between freezes was about 14 days or so, but not within a few days as yours.
But just as an idea???

Regards

tuser11

@FSC830 No, RAM disk is disabled in System > Advanced > Miscellaneous.

stephenw10

Mmm, I run ramdisks and am not seeing any issues in 23.05.1.

So to confirm; your seeing this in every version of pfSense since 23.01?
If you reinstall 22.05 do it stop happening?

If you run one of those 3100s with 23.05.1 in a different location does it still happen?

Steve

tuser11

@stephenw10 Yes, in every version since 23.01. I've only run the 2 boxes at this location. The other location i'm running a 3100 with the same software without problems. Both locations have very similar configurations and all the same packages installed. I have not tried taking one of the units from the place that has a primary and backup and taking it to another location. I've just been swapping the primary and backup in the same office.

Yesterday i did a fresh install from a support image I had on file to the backup box, from there i upgraded to latest, then restored the configuration and put it back into production but elevated the unit for more airflow. Even though my logs show temp didn't break 76C 12 seconds before last failure, I don't know how accurate the pfSense temp monitor is, or if my assumptions outlined in my last post are correct.

I could swap it with one in my home lab but I'd rather not and do whatever troubleshooting steps (if anymore exists) to confirm the issue is isolated. Using the 2 boxes in the one location instead of moving boxes to different buildings is potentially interrupting workflows somewhere else. The problem definitely seems isolated to this location based on observing the problem with 2 boxes and 2 power supplies in the same building.

stephenw10

The biggest thing that makes me think it's something unique to that install/location is that there are a lot of 3100s running 23.05.1 and if this were common to all 23.0X installs we would be flooded with support tickets.
It almost has to be some combination of unusual things in that specific setup. Testing either of those units in a different location would confirm that.

One other thing we could try is using the debug kernel. If there is some issue it might throw some additional errors before it stops responding. I wouldn't really expect to see anything else when it stops though as it logs nothing at all currently.

Steve

tuser11

It's been over a month since the last lockup after consistently locking up 2-4 times a month. There was also an un-managed switch that was randomly failing (3 switches downstream of the router and only managing traffic for 2 computers so seemingly unrelated) and it magically stopped failing. I've only been troubleshooting networks for ~9 years and it's not my primary job. I've never seen hardware problems go away on their own while usage remains the same.

The only things that have happened since the last lockup:

Kicked everyone off wifi networks (even separate guest wifi) for a few days after the last failure. The networks are segregated and firewalled but without high confidence in my log analysis, this seemed like a fair step.
Publicly made plans in office after getting green light from owner to start locking down network (every device would have to be registered mac and static IP pair before being allowed on network) if problem persisted.
Re-allowed everyone to use wifi as usual with the knowledge that the network will eventually be locked down (no more personal devices able to easily get on) to help isolate the mysterious problems if they kept occurring.

No changes were made, just made plans for next move. All hardware and general usage has remained unchanged and in over a month, not 1 failure.

Any ideas about how the issues mysteriously went away (or at least haven't happened in ~38 days)?

stephenw10

Some general power issue maybe?

tuser11

@stephenw10 Not sure since power hasn't changed to my knowledge. And my hopes would be that power issues at that scale would also affect other equipment. For example the issue with the switch was only going on for about a month where as the router issues date back to before the summer.
My initial hunch was power user or script kiddie based on the environment and employee history.