Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    SG5100 halts every 10 days

    General pfSense Questions
    4
    14
    1.1k
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • S
      scratchydog
      last edited by

      Hi,

      I have an SG5100 running the latest 2.4.5_p1. The following issue also existed on the previous SW version.

      Approx. every 10 days or so, the system seems to halt/crash. Throughput functions for hosts on the networks stops and hosts routing via the firewall will generally fail to communicate. Pinging the firewall results in intermittent ping responses cycling between host down responses combined with some normal replies.. I note that normal ping response is 0.2ms from the host I test on VS 2 or 3ms when the firewall has this issue. (So im assuming some functionality is still there)

      Access to the GUI during this time is either 'host timed out' or sometimes the GUI page comes up and loads after a long time attempting to load. Logging into the GUI can be attempted but the gui never finishes logging in and the page hangs.

      Access via SSH is similar to the GUI. The login prompt loads allowing the password to be entered. However, after this the FW never loads the user in and you never get to the command menu.

      Access via Serial is very similar to the SSH. Serial console connects fine with the login prompt. input of usename/pwd results in a hang and blank screen in the vty session.

      At this point, I basically run out of options and have to go for the restart by holding down the power button which successfully restarts the device. The device will boot fine with no warnings visible on the console (if I watch the boot via serial) and/or examine the logs after boot via GUI or SSH etc.

      I am noticing this last time that a Suricata update was the last thing in the FW log. However, the FW was working fine for about 8hrs after the last log entry in the FW system log before it crashed as per above.
      Disk space sits at around 30%
      CPU 3%
      Memory 60%

      Obviously I can't see what its doing when it crashes as it where.

      Running Suricata, ntpNG, Open VPN.

      Likely to have any idea what the issue is, i'll have to find a way to capture logs during the crash so ideas on how this could be best done are most welcome. Also keen to know if there is any known bugs or other items to check.

      I could always just set a scheduled restart every 7 days but that is just a work around and will cause other issues.

      Thanks in advance for any comments / pointers.

      bmeeksB 1 Reply Last reply Reply Quote 0
      • bmeeksB
        bmeeks @scratchydog
        last edited by bmeeks

        @scratchydog said in SG5100 halts every 10 days:

        Hi,

        I have an SG5100 running the latest 2.4.5_p1. The following issue also existed on the previous SW version.

        Approx. every 10 days or so, the system seems to halt/crash. Throughput functions for hosts on the networks stops and hosts routing via the firewall will generally fail to communicate. Pinging the firewall results in intermittent ping responses cycling between host down responses combined with some normal replies.. I note that normal ping response is 0.2ms from the host I test on VS 2 or 3ms when the firewall has this issue. (So im assuming some functionality is still there)

        Access to the GUI during this time is either 'host timed out' or sometimes the GUI page comes up and loads after a long time attempting to load. Logging into the GUI can be attempted but the gui never finishes logging in and the page hangs.

        Access via SSH is similar to the GUI. The login prompt loads allowing the password to be entered. However, after this the FW never loads the user in and you never get to the command menu.

        Access via Serial is very similar to the SSH. Serial console connects fine with the login prompt. input of usename/pwd results in a hang and blank screen in the vty session.

        At this point, I basically run out of options and have to go for the restart by holding down the power button which successfully restarts the device. The device will boot fine with no warnings visible on the console (if I watch the boot via serial) and/or examine the logs after boot via GUI or SSH etc.

        I am noticing this last time that a Suricata update was the last thing in the FW log. However, the FW was working fine for about 8hrs after the last log entry in the FW system log before it crashed as per above.
        Disk space sits at around 30%
        CPU 3%
        Memory 60%

        Obviously I can't see what its doing when it crashes as it where.

        Running Suricata, ntpNG, Open VPN.

        Likely to have any idea what the issue is, i'll have to find a way to capture logs during the crash so ideas on how this could be best done are most welcome. Also keen to know if there is any known bugs or other items to check.

        I could always just set a scheduled restart every 7 days but that is just a work around and will cause other issues.

        Thanks in advance for any comments / pointers.

        The first thing I would do to troubleshoot is remove (or deactivate) Suricata and ntopNG. I assume OpenVPN is more critical to your operation, so leave that. You can certainly run for a few days without Suricata and ntopNG. See how the stability is then. If you still have crashes, then hardware is the most likely suspect. If things run smoothly, then add back just one of the packages and rinse and repeat. You need to isolate things to find the cause.

        S 1 Reply Last reply Reply Quote 1
        • S
          scratchydog @bmeeks
          last edited by

          @bmeeks
          I will certainly give that a go and remove the packages.
          I did this previously as I was getting this issue on the previous PFSense SW version. But it still would crash with just FW and OpenVPN running.

          I previously had a similar issue with the FW crashing and displaying the red led on the front and the power button would not power down the device. Rebooting via power just resulted in the red light and a failed boot. I ended up restoring the OS via serial cable.
          After that, the current situation (very similar to the crashing previously but without the red LED which possibly indicates a full system crash or FW booting - from what I see in the manual ) is happening post OS reload + updating to lates pfsense version.

          I am wondering if anyone has experienced this type of thing and the cause has been RAM ? Of all hardware items, that would be the easiest to replace. I will also assume any replacement needs to be ECC.

          bmeeksB 1 Reply Last reply Reply Quote 0
          • bmeeksB
            bmeeks @scratchydog
            last edited by

            @scratchydog said in SG5100 halts every 10 days:

            @bmeeks
            I will certainly give that a go and remove the packages.
            I did this previously as I was getting this issue on the previous PFSense SW version. But it still would crash with just FW and OpenVPN running.

            I previously had a similar issue with the FW crashing and displaying the red led on the front and the power button would not power down the device. Rebooting via power just resulted in the red light and a failed boot. I ended up restoring the OS via serial cable.
            After that, the current situation (very similar to the crashing previously but without the red LED which possibly indicates a full system crash or FW booting - from what I see in the manual ) is happening post OS reload + updating to lates pfsense version.

            I am wondering if anyone has experienced this type of thing and the cause has been RAM ? Of all hardware items, that would be the easiest to replace. I will also assume any replacement needs to be ECC.

            Might be RAM or it could even be a capacitor on the board. Lots of failure points when you get to hardware. Since you say it failed on you in the past with a plain vanilla install and no packages, I would suspect either hardware or power issues. Is the box on a UPS? Are there (or could there have been) unexpected power interruptions or blips? The SG-5100 and any other appliance is, at its heart, a running computer. And any running computer is very picky about improper shutdowns (as in the power blinking out). That can leave the disk corrupted, even a solid-state disk.

            S 1 Reply Last reply Reply Quote 0
            • S
              scratchydog @bmeeks
              last edited by

              @bmeeks
              I'll try with no packages installed again.

              I have the FW on a large UPS, so power is Rock Solid since the day I purchased the FW.
              Temp of the room is controlled to about 23'C as well so its never gotten hot to my knowledge.

              Will try with the packages and then RAM. If not then its probably time for the bin.
              Bit of a shame as it should be a reasonably impressive device otherwise along with a reasonably impressive purchase price.

              G GertjanG 2 Replies Last reply Reply Quote 0
              • G
                gabacho4 Rebel Alliance @scratchydog
                last edited by

                @scratchydog another thing I might try is to reinstall pfsense and not restore your previous config. Obviously it's a bit of work to get things configued again but perhaps there is something going on with your current config that is causing the issue? I have had issues in the past myself with an SG-2440 just going unresponsive and, after convincing myself to bite the bullet and do the config all over, I never had an issue. I was bringing my configuration from a Protectli device so I think either something about the config didn't agree with the 2440 or I screwed up the initial config on the Protectli device and the 2440 was merely crashing as a result of that error. Did you bring your configuration from another pfsense device run prior to the SG-5100? I run two 5100s with no issues but only use pfblocker and openvpn client export packages. BTW-which light on the 5100 is red? There are 3 LEDs each corresponding to a different function if I recall right.

                G 1 Reply Last reply Reply Quote 0
                • G
                  gabacho4 Rebel Alliance @gabacho4
                  last edited by

                  @gabacho4 btw your issue is very similar to what I would have. Crappy internet speed but still working, no ssh, no web ui. Unfortunately I would have to do a completely new install and configuration to get my router working again. A restart did nothing.

                  GertjanG 1 Reply Last reply Reply Quote 0
                  • GertjanG
                    Gertjan @scratchydog
                    last edited by

                    @scratchydog said in SG5100 halts every 10 days:

                    I have the FW on a large UPS, so power is Rock Solid since the day I purchased the FW.

                    Still, you power is messy. It's you rebooting the system the bad way : by removing the power.

                    This video is very important for those that uses devices with use file systems. That makes us a couple of billion. You included.
                    Every time you have to hard reboot your pfSense, check, using console access, if the file system is clean. If it's dirty, you can have issues that show up right away, or later on.

                    @scratchydog said in SG5100 halts every 10 days:

                    a reasonably impressive device

                    I've seen pfSense installs on boxes as big as a packet of Malboro, or 9 inch rack equipment.
                    The size or price of the box doesn't impact the 'uptime' of pfSense. Both can run for weeks, months, probably years (= bad, the admin isn't updating ..) . or minutes. It depends mostly the settings, and what it is used for.

                    When you log into pfSense, using https://your-pfsense.local.tld, you are redirected to the default page, this is normally the dashboard.
                    The dashboard shows info that is often always available in a file on the drive, in a process that already runs. But some info isn't cached, or the cached info considered 'old', so the info is refreshed. On some of it, like, for example the list with installed packages, is reloaded from the package repository first and compared with what you have installed.
                    This needs a WAN access ... and if the WAN is bad, the building of the dashboard page will take .... time.
                    A often seen issue is that admins nicely broke their DNS setup "because they have seen on Internet that it has to be done like that". Or were forwarding to DNS servers that are not viable. The result will be the same - and bad.

                    Test this access :
                    https://your-pfsense.local.tld/stats.php
                    or
                    https://your-pfsense.local.tld/status.php
                    as these only use local info.
                    There should be no delays related to WAN access.

                    @scratchydog said in SG5100 halts every 10 days:

                    the FW was working fine for about 8hrs after the last log entry in the FW system log before it crashed as per above.

                    That's very significant info !!!
                    What do you mean by "FW system log" ? Only that log ? Any log ?
                    pfSense was running for 8 hours without any log entry no where ?? That close to impossible. It says to me : pfSense was running, but it couldn't write anything any more ....
                    Do this test : take you PC, fill up it's boot drive to the last byte - there are free tools that can do just that in a couple of seconds. Run it, and admire the result. (Please : check first front your back status, as you will needing it).

                    @scratchydog said in SG5100 halts every 10 days:

                    Disk space sits at around 30%
                    CPU 3%
                    Memory 60%

                    Looks good, as I have :

                    7840b9d9-76b9-4449-b886-dc5aedef6499-image.png

                    Your memory is some what high, probably because your using Suricata.

                    @scratchydog said in SG5100 halts every 10 days:

                    Obviously I can't see what its doing when it crashes as it where.

                    Yes, you can.

                    It's the 1 minute, 5 minutes, an hour or more before the 'shutdown' that needs to be checked :
                    That's why there are all these logs.
                    And Status > Monitoring -> select System and processor check for variations. Check also other monitoring stats like WAN quality, System Memory usage, etc.

                    I advise you to :
                    Use default settings (reset to default). Even re install pfSense from scratch as it is a nice 5 minute exercise (always good to train things that you need to handle 'by the book' in emergency situations).
                    Ok to add some DHCP static leases, ok to add user for VPN access, ok to fire up the VPN server (not client). Ok to add the VPN exprt package (as it never runs anyway, only when the admin clicks on "export a user").
                    If the system still behaves bad, reset again, and do even less personalisation.
                    if needed, only set up the WAN to make it work.
                    Still not ok ? Chances are now that it is a hardware issue. Call support.

                    No "help me" PM's please. Use the forum, the community will thank you.
                    Edit : and where are the logs ??

                    1 Reply Last reply Reply Quote 0
                    • GertjanG
                      Gertjan @gabacho4
                      last edited by

                      @gabacho4 said in SG5100 halts every 10 days:

                      Unfortunately I would have to do a completely new install

                      That why you should keep it simple.
                      The combination of pfSense and humans is actually very unfortunate.
                      People like packets, gadgets ..... apps .... etc, and keep up adding them - we want to 'have' it.
                      and then the card house falls down and you have to re have it all again.

                      edit : and if you need to have something, know how to maintain it, what it does. You should actually be able "by hand" to do what the package does. This means, if a package doesn't work, you'll be able to check it out fast, find the issue fast. Never ask scripts to do something you can't do yourself.

                      So, the rule was - and stays : keep it simple.
                      Thus easily reproducible.

                      Btw : setting up pfsense doesn't need tools, if you don't count the mouse and keyboard.
                      Get a daily copy of your config.xml ( use this for example ).

                      No "help me" PM's please. Use the forum, the community will thank you.
                      Edit : and where are the logs ??

                      G 1 Reply Last reply Reply Quote 0
                      • G
                        gabacho4 Rebel Alliance @Gertjan
                        last edited by

                        @gertjan what I meant was that a simple solution like restarting the router didn't fix the problem. Something with the configuration went horribly wrong. I literally had to reinstall pfsense, then either configure by hand or use one of my backups. Once I did it by hand, the issue never reoccurred this leading me to believe something about the way I configured the Protectli device was the problem.

                        GertjanG 1 Reply Last reply Reply Quote 0
                        • GertjanG
                          Gertjan @gabacho4
                          last edited by

                          @gabacho4 said in SG5100 halts every 10 days:

                          or use one of my backups.

                          Be careful with this one.

                          My pfSense and your pfSense are 100 % identical.
                          Because we use the same source. When copying bits, the result is known : your '1's and '0's are identical to mine.
                          What differs is : your config.
                          So, install it, assign WAN and LAN, make a password, make WAN work and stop there. You reached a save point. It should work .... a long time.

                          No "help me" PM's please. Use the forum, the community will thank you.
                          Edit : and where are the logs ??

                          1 Reply Last reply Reply Quote 0
                          • S
                            scratchydog
                            last edited by

                            Some very good suggestions and insight here.
                            Thank to everyone who has taken the time to help with this. Much appreciated. This is a great community.

                            Netgate support also recommended doing a fsck (As also was mentioned here - many tnx.) to find and fix any errors from single user mode. I completed this, although i'm not 100% sure it found any issues to fix.

                            .... yes i agree, the forced power downs when it freezes negate the inclusion of a UPS here. That could /will corrupt things a lot.

                            I will see if the fsck results in any success. If not, I think as mentioned here, a full Vanilla build up from scratch may be needed. There is a big chance that keeping the same config over various base pfsense versions will cause issues and best practice would be to always rebuild the config up from scratch each upgrade - certainly, you would at least want a config with Zero packages as a base vanilla and work up from there.

                            Re Ram usage, I have setup some Ram Disks for /tmp and /var so RAM usage is up a bit. The idea was that with ntopNG running, the I/O on the SSD would be lowered, prolonging the disk life. Not sure if this effort to benefit is worth it?

                            Possibly causing more issues than I am solving.

                            1 Reply Last reply Reply Quote 0
                            • S
                              scratchydog
                              last edited by

                              One additional question I had was if I were to try some different ram, I assume it is pointless to go for anything other than ECC ram?

                              G 1 Reply Last reply Reply Quote 0
                              • G
                                gabacho4 Rebel Alliance @scratchydog
                                last edited by

                                @scratchydog the ECC ram is hard, if not nearly impossible, to find. I've seen posts from people saying they use non-ECC just fine. If you search this forum or Google you should get to those threads.

                                1 Reply Last reply Reply Quote 0
                                • First post
                                  Last post
                                Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.