SG-3100 hangs every 1-2 days



  • Hi,

    I have several Netgate SG-3100 units.

    One of those units appears to periodically lock-up or hang every 1-2 days, which is quite strange..

    It is running 2.5.0-devel (as are several other units), but I'm not clear if it's a hardware issue, or a software issue. I have done updates to the latest DEV builds, and the issue persists.

    Symptoms:

    1. All LAN hosts lose connectivity.
    2. Unable to connect to the pfSense Web UI.
    3. If I connect via the Mini-USB console plug, the unit appears unresponsive.
    4. The front lights are flashing sequentially in blue, and the rear LAN activity are flashing.

    I have excerpted /var/log/system.log below, the router appears to have hung sometimes around 02:54 hrs - then we rebooted it at 06:10 (pulling the power plug and reinserting).

    Oct 15 02:42:59 foobarrouter sshguard[90086]: Attack from "167.99.131.243" on service SSH with danger 2.
    Oct 15 02:47:05 foobarrouter sshd[41617]: Did not receive identification string from 222.161.223.147 port 43260
    Oct 15 02:47:05 foobarrouter sshguard[90086]: Attack from "222.161.223.147" on service SSH with danger 10.
    Oct 15 02:48:30 foobarrouter sshd[16776]: Unable to negotiate with 218.92.0.185 port 37838: no matching key exchange method found. Their offer: diffie-hellman-group1-sha1,diffie-hellman-group14-sha1,diffie-hellman-group-exchange-sha1 [preauth]
    Oct 15 02:48:30 foobarrouter sshguard[90086]: Attack from "218.92.0.185" on service SSH with danger 10.
    Oct 15 02:50:46 foobarrouter sshd[93530]: Connection closed by 51.210.14.124 port 53144 [preauth]
    Oct 15 02:50:46 foobarrouter sshguard[90086]: Attack from "51.210.14.124" on service SSH with danger 2.
    Oct 15 02:52:20 foobarrouter sshd[33215]: Connection closed by 167.172.52.225 port 42462 [preauth]
    Oct 15 02:52:20 foobarrouter sshguard[90086]: Attack from "167.172.52.225" on service SSH with danger 2.
    Oct 15 02:53:05 foobarrouter sshd[69645]: Unable to negotiate with 122.194.229.122 port 59920: no matching key exchange method found. Their offer: diffie-hellman-group1-sha1,diffie-hellman-group14-sha1,diffie-hellman-group-exchange-sha1 [preauth]
    Oct 15 02:53:05 foobarrouter sshguard[90086]: Attack from "122.194.229.122" on service SSH with danger 10.
    Oct 15 02:54:14 foobarrouter rc.gateway_alarm[7799]: >>> Gateway alarm: OPT8GW (Addr:110.175.245.125 Alarm:1 RTT:14.945ms RTTsd:4.263ms Loss:22%)
    Oct 15 02:54:14 foobarrouter check_reload_status[386]: updating dyndns OPT8GW
    Oct 15 02:54:14 foobarrouter check_reload_status[386]: Restarting ipsec tunnels
    Oct 15 02:54:14 foobarrouter check_reload_status[386]: Restarting OpenVPN tunnels/interfaces
    Oct 15 02:54:14 foobarrouter check_reload_status[386]: Reloading filter
    Oct 15 06:20:02 foobarrouter syslogd: kernel boot file is /boot/kernel/kernel
    Oct 15 06:20:02 foobarrouter kernel: ---<<BOOT>>---
    Oct 15 06:20:02 foobarrouter kernel: Copyright (c) 1992-2020 The FreeBSD Project.
    Oct 15 06:20:02 foobarrouter kernel: Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
    Oct 15 06:20:02 foobarrouter kernel:         The Regents of the University of California. All rights reserved.
    Oct 15 06:20:02 foobarrouter kernel: FreeBSD is a registered trademark of The FreeBSD Foundation.
    Oct 15 06:20:02 foobarrouter kernel: FreeBSD 12.2-STABLE 56359d090cf(factory-devel-12) pfSense-SG-3100 arm
    Oct 15 06:20:02 foobarrouter kernel: FreeBSD clang version 10.0.1 (git@github.com:llvm/llvm-project.git llvmorg-10.0.1-0-gef32c611aa2)
    Oct 15 06:20:02 foobarrouter kernel: CPU: ARM Cortex-A9 r4p1 (ECO: 0x00000000)
    Oct 15 06:20:02 foobarrouter kernel: CPU Features: 
    


  • @victorhooi said in SG-3100 hangs every 1-2 days:

    (pulling the power plug and reinserting)

    This step is advisable.

    edit : and give sshguard a break :remove all firewall "SSH" rules on WAN, only an incoming VPN rule should be there.


  • Netgate Administrator

    Connect the serial console to something and log that if you can. That will show if something is throwing an error and causing the reboot.

    Do you have the watchdog enabled in System > Advanced > Misc?
    If you disable it does it just hang rather than reboot? That might show something at the console if you can't log all it's output.

    And, yeah, you should lock down the WAN from all those drive-by SSH attempts!

    Steve



  • I ran a filesystem check per the video link, and online documentation. Output seems to be clean from what I can see:

    Enter full pathname of shell or RETURN for /bin/sh: 
    # fsck -fy /
    ** /dev/diskid/DISK-CEF032182700245s2a
    ** Last Mounted on /
    ** Root file system
    ** Phase 1 - Check Blocks and Sizes
    ** Phase 2 - Check Pathnames
    ** Phase 3 - Check Connectivity
    ** Phase 4 - Check Reference Counts
    ** Phase 5 - Check Cyl groups
    SUMMARY INFORMATION BAD
    SALVAGE? yes
    
    FREE BLK COUNT(S) WRONG IN SUPERBLK
    SALVAGE? yes
    
    BLK(S) MISSING IN BIT MAPS
    SALVAGE? yes
    
    33074 files, 440466 used, 6993345 free (26281 frags, 870883 blocks, 0.4% fragmentation)
    
    ***** FILE SYSTEM IS CLEAN *****
    
    ***** FILE SYSTEM WAS MODIFIED *****
    # fsck -fy /
    ** /dev/diskid/DISK-CEF032182700245s2a
    ** Last Mounted on /
    ** Root file system
    ** Phase 1 - Check Blocks and Sizes
    ** Phase 2 - Check Pathnames
    ** Phase 3 - Check Connectivity
    ** Phase 4 - Check Reference Counts
    ** Phase 5 - Check Cyl groups
    33074 files, 440466 used, 6993345 free (26281 frags, 870883 blocks, 0.4% fragmentation)
    
    ***** FILE SYSTEM IS CLEAN *****
    # fsck -fy /
    ** /dev/diskid/DISK-CEF032182700245s2a
    ** Last Mounted on /
    ** Root file system
    ** Phase 1 - Check Blocks and Sizes
    ** Phase 2 - Check Pathnames
    ** Phase 3 - Check Connectivity
    ** Phase 4 - Check Reference Counts
    ** Phase 5 - Check Cyl groups
    33074 files, 440466 used, 6993345 free (26281 frags, 870883 blocks, 0.4% fragmentation)
    
    ***** FILE SYSTEM IS CLEAN *****
    # fsck -fy /
    ** /dev/diskid/DISK-CEF032182700245s2a
    ** Last Mounted on /
    ** Root file system
    ** Phase 1 - Check Blocks and Sizes
    ** Phase 2 - Check Pathnames
    ** Phase 3 - Check Connectivity
    ** Phase 4 - Check Reference Counts
    ** Phase 5 - Check Cyl groups
    33074 files, 440466 used, 6993345 free (26281 frags, 870883 blocks, 0.4% fragmentation)
    
    ***** FILE SYSTEM IS CLEAN *****
    # fsck -fy /
    ** /dev/diskid/DISK-CEF032182700245s2a
    ** Last Mounted on /
    ** Root file system
    ** Phase 1 - Check Blocks and Sizes
    ** Phase 2 - Check Pathnames
    ** Phase 3 - Check Connectivity
    ** Phase 4 - Check Reference Counts
    ** Phase 5 - Check Cyl groups
    33074 files, 440466 used, 6993345 free (26281 frags, 870883 blocks, 0.4% fragmentation)
    
    ***** FILE SYSTEM IS CLEAN *****
    

    I have checked - I do have the watchguard enabled, and set to 128 seconds. (This appears to be default, as I don't believe I've ever changed this setting):

    dddd3c14-d003-487e-8479-1c6da8beb790-image.png

    When the issue occurs, the SG-3100 doesn't reboot - it simply hangs, the activity lights keep flashing, but there's no longer any internet connectivity for LAN hosts, and you can't SSH or connect to the web interface.

    I've plugged it into a console server, to log the serial output - it doesn't seem to log anything when the issue happens. The next line after the pfSense console prompt is I believe from the startup (after we've power-cycled the unit):

     WIFI (opt9)     -> mvneta1.65 -> v4: 10.7.65.1/23
     PRINTERS (opt10) -> mvneta1.148 -> v4: 10.7.148.1/24
     HIKVISION_DEFAULT (opt11) -> mvneta1.99 -> v4: 192.168.1.1/24
     0) Logout (SSH only)                  9) pfTop
     1) Assign Interfaces                 10) Filter Logs
     2) Set interface(s) IP address       11) Restart webConfigurator
     3) Reset webConfigurator password    12) PHP shell + pfSense tools
     4) Reset to factory defaults         13) Update from console
     5) Reboot system                     14) Disable Secure Shell (sshd)
     6) Halt system                       15) Restore recent configuration
     7) Ping host                         16) Restart PHP-FPM
     8) Shell
    Enter an option:
    General initialization - Version: 1.0.0
    AVS selection from EFUSE disabled (Skip reading EFUSE values)
    Overriding default AVS value to: 0x23
    Detected Device ID 6820
    High speed PHY - Version: 2.0
    Init Customer board board SerDes lanes topology details:
     | Lane # | Speed|    Type     |
     ------------------------------|
     |   0    |  3   |  SATA0      |
     |   1    |  5   |  PCIe0      |
     |   2    |  3   |  SATA1      |
    

    Is it an issue with the unit or something else?


  • Netgate Administrator

    With no output at all and the console itself unresponsive it does start to look more like a hardware issue. Except that you are running a 2.5 snapshot so you might be hitting a hard software lock somehow.

    I would re-install 2.4.5p1 as a next step to test that.

    Steve


Log in to reply