SG-3100 hangs every 1-2 days
-
Hi,
I have several Netgate SG-3100 units.
One of those units appears to periodically lock-up or hang every 1-2 days, which is quite strange..
It is running 2.5.0-devel (as are several other units), but I'm not clear if it's a hardware issue, or a software issue. I have done updates to the latest DEV builds, and the issue persists.
Symptoms:
- All LAN hosts lose connectivity.
- Unable to connect to the pfSense Web UI.
- If I connect via the Mini-USB console plug, the unit appears unresponsive.
- The front lights are flashing sequentially in blue, and the rear LAN activity are flashing.
I have excerpted
/var/log/system.log
below, the router appears to have hung sometimes around 02:54 hrs - then we rebooted it at 06:10 (pulling the power plug and reinserting).Oct 15 02:42:59 foobarrouter sshguard[90086]: Attack from "167.99.131.243" on service SSH with danger 2. Oct 15 02:47:05 foobarrouter sshd[41617]: Did not receive identification string from 222.161.223.147 port 43260 Oct 15 02:47:05 foobarrouter sshguard[90086]: Attack from "222.161.223.147" on service SSH with danger 10. Oct 15 02:48:30 foobarrouter sshd[16776]: Unable to negotiate with 218.92.0.185 port 37838: no matching key exchange method found. Their offer: diffie-hellman-group1-sha1,diffie-hellman-group14-sha1,diffie-hellman-group-exchange-sha1 [preauth] Oct 15 02:48:30 foobarrouter sshguard[90086]: Attack from "218.92.0.185" on service SSH with danger 10. Oct 15 02:50:46 foobarrouter sshd[93530]: Connection closed by 51.210.14.124 port 53144 [preauth] Oct 15 02:50:46 foobarrouter sshguard[90086]: Attack from "51.210.14.124" on service SSH with danger 2. Oct 15 02:52:20 foobarrouter sshd[33215]: Connection closed by 167.172.52.225 port 42462 [preauth] Oct 15 02:52:20 foobarrouter sshguard[90086]: Attack from "167.172.52.225" on service SSH with danger 2. Oct 15 02:53:05 foobarrouter sshd[69645]: Unable to negotiate with 122.194.229.122 port 59920: no matching key exchange method found. Their offer: diffie-hellman-group1-sha1,diffie-hellman-group14-sha1,diffie-hellman-group-exchange-sha1 [preauth] Oct 15 02:53:05 foobarrouter sshguard[90086]: Attack from "122.194.229.122" on service SSH with danger 10. Oct 15 02:54:14 foobarrouter rc.gateway_alarm[7799]: >>> Gateway alarm: OPT8GW (Addr:110.175.245.125 Alarm:1 RTT:14.945ms RTTsd:4.263ms Loss:22%) Oct 15 02:54:14 foobarrouter check_reload_status[386]: updating dyndns OPT8GW Oct 15 02:54:14 foobarrouter check_reload_status[386]: Restarting ipsec tunnels Oct 15 02:54:14 foobarrouter check_reload_status[386]: Restarting OpenVPN tunnels/interfaces Oct 15 02:54:14 foobarrouter check_reload_status[386]: Reloading filter Oct 15 06:20:02 foobarrouter syslogd: kernel boot file is /boot/kernel/kernel Oct 15 06:20:02 foobarrouter kernel: ---<<BOOT>>--- Oct 15 06:20:02 foobarrouter kernel: Copyright (c) 1992-2020 The FreeBSD Project. Oct 15 06:20:02 foobarrouter kernel: Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 Oct 15 06:20:02 foobarrouter kernel: The Regents of the University of California. All rights reserved. Oct 15 06:20:02 foobarrouter kernel: FreeBSD is a registered trademark of The FreeBSD Foundation. Oct 15 06:20:02 foobarrouter kernel: FreeBSD 12.2-STABLE 56359d090cf(factory-devel-12) pfSense-SG-3100 arm Oct 15 06:20:02 foobarrouter kernel: FreeBSD clang version 10.0.1 (git@github.com:llvm/llvm-project.git llvmorg-10.0.1-0-gef32c611aa2) Oct 15 06:20:02 foobarrouter kernel: CPU: ARM Cortex-A9 r4p1 (ECO: 0x00000000) Oct 15 06:20:02 foobarrouter kernel: CPU Features:
-
@victorhooi said in SG-3100 hangs every 1-2 days:
(pulling the power plug and reinserting)
edit : and give sshguard a break :remove all firewall "SSH" rules on WAN, only an incoming VPN rule should be there.
-
Connect the serial console to something and log that if you can. That will show if something is throwing an error and causing the reboot.
Do you have the watchdog enabled in System > Advanced > Misc?
If you disable it does it just hang rather than reboot? That might show something at the console if you can't log all it's output.And, yeah, you should lock down the WAN from all those drive-by SSH attempts!
Steve
-
I ran a filesystem check per the video link, and online documentation. Output seems to be clean from what I can see:
Enter full pathname of shell or RETURN for /bin/sh: # fsck -fy / ** /dev/diskid/DISK-CEF032182700245s2a ** Last Mounted on / ** Root file system ** Phase 1 - Check Blocks and Sizes ** Phase 2 - Check Pathnames ** Phase 3 - Check Connectivity ** Phase 4 - Check Reference Counts ** Phase 5 - Check Cyl groups SUMMARY INFORMATION BAD SALVAGE? yes FREE BLK COUNT(S) WRONG IN SUPERBLK SALVAGE? yes BLK(S) MISSING IN BIT MAPS SALVAGE? yes 33074 files, 440466 used, 6993345 free (26281 frags, 870883 blocks, 0.4% fragmentation) ***** FILE SYSTEM IS CLEAN ***** ***** FILE SYSTEM WAS MODIFIED ***** # fsck -fy / ** /dev/diskid/DISK-CEF032182700245s2a ** Last Mounted on / ** Root file system ** Phase 1 - Check Blocks and Sizes ** Phase 2 - Check Pathnames ** Phase 3 - Check Connectivity ** Phase 4 - Check Reference Counts ** Phase 5 - Check Cyl groups 33074 files, 440466 used, 6993345 free (26281 frags, 870883 blocks, 0.4% fragmentation) ***** FILE SYSTEM IS CLEAN ***** # fsck -fy / ** /dev/diskid/DISK-CEF032182700245s2a ** Last Mounted on / ** Root file system ** Phase 1 - Check Blocks and Sizes ** Phase 2 - Check Pathnames ** Phase 3 - Check Connectivity ** Phase 4 - Check Reference Counts ** Phase 5 - Check Cyl groups 33074 files, 440466 used, 6993345 free (26281 frags, 870883 blocks, 0.4% fragmentation) ***** FILE SYSTEM IS CLEAN ***** # fsck -fy / ** /dev/diskid/DISK-CEF032182700245s2a ** Last Mounted on / ** Root file system ** Phase 1 - Check Blocks and Sizes ** Phase 2 - Check Pathnames ** Phase 3 - Check Connectivity ** Phase 4 - Check Reference Counts ** Phase 5 - Check Cyl groups 33074 files, 440466 used, 6993345 free (26281 frags, 870883 blocks, 0.4% fragmentation) ***** FILE SYSTEM IS CLEAN ***** # fsck -fy / ** /dev/diskid/DISK-CEF032182700245s2a ** Last Mounted on / ** Root file system ** Phase 1 - Check Blocks and Sizes ** Phase 2 - Check Pathnames ** Phase 3 - Check Connectivity ** Phase 4 - Check Reference Counts ** Phase 5 - Check Cyl groups 33074 files, 440466 used, 6993345 free (26281 frags, 870883 blocks, 0.4% fragmentation) ***** FILE SYSTEM IS CLEAN *****
I have checked - I do have the watchguard enabled, and set to 128 seconds. (This appears to be default, as I don't believe I've ever changed this setting):
When the issue occurs, the SG-3100 doesn't reboot - it simply hangs, the activity lights keep flashing, but there's no longer any internet connectivity for LAN hosts, and you can't SSH or connect to the web interface.
I've plugged it into a console server, to log the serial output - it doesn't seem to log anything when the issue happens. The next line after the pfSense console prompt is I believe from the startup (after we've power-cycled the unit):
WIFI (opt9) -> mvneta1.65 -> v4: 10.7.65.1/23 PRINTERS (opt10) -> mvneta1.148 -> v4: 10.7.148.1/24 HIKVISION_DEFAULT (opt11) -> mvneta1.99 -> v4: 192.168.1.1/24 0) Logout (SSH only) 9) pfTop 1) Assign Interfaces 10) Filter Logs 2) Set interface(s) IP address 11) Restart webConfigurator 3) Reset webConfigurator password 12) PHP shell + pfSense tools 4) Reset to factory defaults 13) Update from console 5) Reboot system 14) Disable Secure Shell (sshd) 6) Halt system 15) Restore recent configuration 7) Ping host 16) Restart PHP-FPM 8) Shell Enter an option: General initialization - Version: 1.0.0 AVS selection from EFUSE disabled (Skip reading EFUSE values) Overriding default AVS value to: 0x23 Detected Device ID 6820 High speed PHY - Version: 2.0 Init Customer board board SerDes lanes topology details: | Lane # | Speed| Type | ------------------------------| | 0 | 3 | SATA0 | | 1 | 5 | PCIe0 | | 2 | 3 | SATA1 |
Is it an issue with the unit or something else?
-
With no output at all and the console itself unresponsive it does start to look more like a hardware issue. Except that you are running a 2.5 snapshot so you might be hitting a hard software lock somehow.
I would re-install 2.4.5p1 as a next step to test that.
Steve
-
Hmm, the issue is still happening on 2.4.5p1 . Seems slightly less often (every 2 days or so - but that may just be coincidence).
If it is hardware - is this repairable at all? Or is there anything Netgate can do?
-
Open a ticket with the device details: https://go.netgate.com/
Steve