Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    2x pfsense 24.11 hard crashes in under a week - Netgate 1537

    Scheduled Pinned Locked Moved General pfSense Questions
    11 Posts 5 Posters 214 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • J
      joekislo
      last edited by

      We have had two of our netgate 1537s crash in under a week with the same symptoms. Posting here incase anybody else runs into something similar since we have no explanation after >140 days of uptime on both units. Netgate support is stumped and has no explanation.

      We have two identical netgate 1537s in a HA pair. A week ago our standby firewall went non-unpingable, either via it's LACP connections or via the crossover cable between the two firewalls. Physical console via VGA connector/USB keyboard showed months old system messages, but otherwise no response to keyboard input. Numlock would go on/off when pressed. Control-Alt-Delete did not respond.

      When the firewall went offline our core switch suspended the port:
      Jul 23 21:01:18 sw-core sw-core-1: Jul 24 01:01:16.729: %EC-5-UNBUNDLE: STANDBY:Interface Te1/2/1 left the port-channel Po21
      Jul 23 21:01:19 sw-core sw-core-1: Jul 24 01:01:17.953: %EC-5-UNBUNDLE: STANDBY:Interface Te2/2/1 left the port-channel Po21
      Jul 23 21:01:19 sw-core sw-core-1: Jul 24 01:01:19.474: %EC-5-UNBUNDLE: Interface Te1/2/1 left the port-channel Po21
      Jul 23 21:01:21 sw-core sw-core-1: Jul 24 01:01:20.507: %EC-5-UNBUNDLE: Interface Te2/2/1 left the port-channel Po21
      Jul 23 21:01:25 sw-core sw-core-1: Jul 24 01:01:24.759: %EC-5-L3DONTBNDL2: Te1/2/1 suspended: LACP currently not enabled on the remote port.
      Jul 23 21:01:27 sw-core sw-core-1: Jul 24 01:01:26.059: %EC-5-L3DONTBNDL2: Te2/2/1 suspended: LACP currently not enabled on the remote port.

      Although the console appeared unresponsive, it does seem like the system was partially alive. This was found in the syslog after we hard power cycled, and is consistent with the time period we had the crash cart plugged in mashing random keys/enter, before hitting the reset button:
      Jul 24 00:18:02 fw1 login[31213]: login on ttyv0 as root
      Jul 24 00:18:03 fw1 login[44763]: login on ttyv0 as root
      Jul 24 00:18:06 fw1 login[56358]: login on ttyv0 as root
      Jul 24 00:18:07 fw1 login[70702]: login on ttyv0 as root
      Jul 24 00:18:40 fw1 login[88262]: login on ttyv0 as root
      Jul 24 00:18:42 fw1 login[99619]: login on ttyv0 as root
      Jul 24 00:18:43 fw1 login[11870]: login on ttyv0 as root
      Jul 24 00:18:52 fw1 login[72783]: login on ttyv0 as root
      Jul 24 00:18:54 fw1 login[85355]: login on ttyv0 as root
      Jul 24 00:19:13 fw1 login[12470]: login on ttyv0 as root
      Jul 24 00:19:14 fw1 login[25801]: login on ttyv0 as root
      Jul 24 00:19:20 fw1 syslogd: exiting on signal 15

      Uptime was 141 days and 144 days for the units. Presumably from when we upgraded to 24.11.

      The units are in a colo facility, we see no temperature deviations during the time. The primary and secondary units are plugged into different power sources. We do not have IMPI configured, when we can get physical access to the facility we'll likely see if there's anything in the IMPI logs. I do doubt this is a hardware issue though given the unit was capable of logging to syslog/disk.

      The units successfully failed over, however not having any explanation gives us concern about the reliability of PfSense or the ability to diagnose a crash. I can't think of a time when I've lost a cisco firewall and didn't have enough info w/ Cisco TAC to figure out the cause (mostly software bugs). We have another identically configured facility with similar uptimes and has not shown this issue yet.

      A possible shot in the dark, we've been bit by the bsnmpd leaking memory bug pretty badly. Supposedly 24.11 fixed this, but as we've reported to netgate, the upgrade fixed the FD leak, but not the memory leak. Both units probably at some point OOMED and ran out of swap before the kernel killed bsnmpd to survive. We have since automated a bsnmpd restart weekly. We would have deployed this auto restart 3 months ago atleast. Perhaps we should have rebooted the units before now, but they were solid for probably 6 months since we realized bsnmpd was still leaking.

      Any help appreciated!

      Thanks,
      -Joe

      GertjanG S 2 Replies Last reply Reply Quote 0
      • GertjanG
        Gertjan @joekislo
        last edited by

        @joekislo said in 2x pfsense 24.11 hard crashes in under a week - Netgate 1537:

        Both units probably at some point OOMED and ran out of swap before the kernel killed bsnmpd to survive

        As long as disk (partition) space isn't an issue, these kind of events can be found in the system log.
        For memory (RAM), as soon as the system is running again, look at Status > Monitoring ans select System => Memory.

        No "help me" PM's please. Use the forum, the community will thank you.
        Edit : and where are the logs ??

        J 2 Replies Last reply Reply Quote 0
        • J
          joekislo @Gertjan
          last edited by

          This post is deleted!
          1 Reply Last reply Reply Quote 0
          • J
            joekislo @Gertjan
            last edited by

            @Gertjan Gertjan No evidence of OOMs before the event, disk space was plenty good as well. FWIW the OOMs were months ago, and I sent all the details to Netgate. Memory has been stable since we started the weekly bnsmpd restarts.

            The dip is the unit hard reset.

            9fe0628d-aaa2-46c0-9ce2-3b6dcf498be8-image.png

            1 Reply Last reply Reply Quote 0
            • stephenw10S
              stephenw10 Netgate Administrator
              last edited by

              Did you try entering ctl+t at the console when it wasn't responding? That can sometimes show output when nothing else will.

              Has this only happened once? On each node?

              1 Reply Last reply Reply Quote 0
              • S
                SteveITS Rebel Alliance @joekislo
                last edited by

                @joekislo said in 2x pfsense 24.11 hard crashes in under a week - Netgate 1537:

                Jul 24 00:19:20 fw1 syslogd: exiting on signal 15

                Just to confirm you think this entry is when you hit the Reset button to reboot?

                We have a 4200 that put itself in standby (according to its LED) on its own, and logged that at the time, like it shut itself down. Haven't seen that anywhere/anytime else, though.

                Only install packages for your version, or risk breaking it. Select your branch in System/Update/Update Settings.
                When upgrading, allow 10-15 minutes to reboot, or more depending on packages, and device or disk speed.
                Upvote ๐Ÿ‘ helpful posts!

                J 1 Reply Last reply Reply Quote 0
                • J
                  jcleaves @SteveITS
                  last edited by

                  @SteveITS Was there anything in the logs regarding going to Standby? I didn't see any power event logged.

                  S 1 Reply Last reply Reply Quote 0
                  • S
                    SteveITS Rebel Alliance @jcleaves
                    last edited by SteveITS

                    @jcleaves Nothing mentioned standby, shutdown, etc., in fact nothing for the previous hour before the "exiting." However we were running a RAM disk on it which we normally do. But the 4200 has a standby LED pattern. Our occurrence was in early June. TAC had us reinstall it.

                    Edit: I understand our situation may not be relevant here.

                    Only install packages for your version, or risk breaking it. Select your branch in System/Update/Update Settings.
                    When upgrading, allow 10-15 minutes to reboot, or more depending on packages, and device or disk speed.
                    Upvote ๐Ÿ‘ helpful posts!

                    1 Reply Last reply Reply Quote 0
                    • stephenw10S
                      stephenw10 Netgate Administrator
                      last edited by stephenw10

                      If you press the ATX power button that's what you would see logged:

                      Jul 28 21:57:35 	php-fpm 	8456 	/index.php: Successful login for user 'admin' from: 172.21.16.8 (Local Database)
                      Jul 29 00:21:11 	syslogd 		exiting on signal 15
                      Jul 29 00:23:07 	syslogd 		kernel boot file is /boot/kernel/kernel
                      Jul 29 00:23:07 	kernel 		---<<BOOT>>---
                      Jul 29 00:23:07 	kernel 		Copyright (c) 1992-2024 The FreeBSD Project.
                      Jul 29 00:23:07 	kernel 		Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 
                      

                      You can disable that by setting the sysctl hw.acpi.power_button_state=none

                      S J 2 Replies Last reply Reply Quote 0
                      • S
                        SteveITS Rebel Alliance @stephenw10
                        last edited by

                        @stephenw10 In our case a button was not pressed, per the one person in the office. It would be nice if it logged a "button push" event.

                        Only install packages for your version, or risk breaking it. Select your branch in System/Update/Update Settings.
                        When upgrading, allow 10-15 minutes to reboot, or more depending on packages, and device or disk speed.
                        Upvote ๐Ÿ‘ helpful posts!

                        1 Reply Last reply Reply Quote 0
                        • J
                          jcleaves @stephenw10
                          last edited by

                          @stephenw10 This was definitely not a button push on ours either. Both units are in locked cabinets in a colo. Any access to the facility is logged.

                          @SteveITS As for it going to standby or hibernating, the person who went on site the LEDs were normal. Nothing indicating a state change or issue.

                          1 Reply Last reply Reply Quote 0
                          • First post
                            Last post
                          Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.