SG-2440 Possible hardware failure (unknown)
I'm writing to hopefully get some advice on where I can continue looking to try and diagnose a strange issue we've been seeing with our SG-2440 (running pfSense 2.4.2 and later 2.4.4-RELEASE-p2 with same behaviour).
In the past week or so we've had three separate incidents where we've lost the ability to reach any of the management interfaces (web interface, SSH, directly connected console) where all requests remotely timing out and the local console connection will connect, but return either 'OK' or nothing. During this time the device seems to be routing traffic correctly as all of our internal users are able to get out to the Internet, however, all of our VPN (IPsec and OpenVPN) links are dead.
The only way to get it back online has been to power cycle the device while holding the reset pin down and having to restore from backup.
Based on the fact that we haven't seen any smoking guns in the logs we're thinking that it could be hardware related but that's more of a guess right now.
Please let me know if any of you have run into anything similar or know where I can be looking closer to determine a root cause for this issue. Any help is appreciated.
Hmm, it shows 'OK' at the serial console when you hit return but nothing else?
Ctl-c does nothing?
Ctl-t will often respond when nothing else does. That will show whatever the current process is which can be useful if it's locked up on something. And it proves it's running, though if it's still routing traffic it must be.
Can you open new connections at that point or only keep open existing traffic?
The most likely way to get info, when there's nothing in the system logs, is to log the console output if you can. That may rake some time if it's not easily replicable.
Thanks @stephenw10 ,
I'll try entering ctrl+t next time to see if that works. ctlc+c does in fact do nothing as it appears to be locked up to all other input.
Once it gets into this state no new connections either local or remote can be opened as far as I know. Requests to the web management page time out and remote SSH connections appear to connect but show the same thing as the console output does (minus the "Ok". Verified that it does accept the connection while viewing verbose output from SSH).
The part that really confuses me is how it continues to route traffic out the WAN interface. If it were failing, I would expect it to just up and die. But it sort of limps along with some functionality for a period of time. I'm not sure how long that would last as by then we're already back in the office having to reboot/rebuild it.
Thanks again for your help,
You can see odd behaviour like that if it exhausts the state table. You should see that logged in the monitoring graphs though and messages reflecting it in the system log.
You can also see odd things if the system is unable to write to root. That can see services slowly drop out as they attempt to restart. And since logs cannot be written it can be easy to miss. You would see error messages i the console though if you are logging that output. So maybe the drive is completely full or it stops repsonding for some reason but recovers after a power cycle. That usually only ever happens on an SSD. The eMMC, if it fails, usually just disappears forever.