Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    Fatal error - regularly

    Scheduled Pinned Locked Moved General pfSense Questions
    9 Posts 5 Posters 1.7k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • J
      JDK
      last edited by

      This is a duplicate of post 115893 to change the posts ownership

      First of all, thank you for all your effort; that a corporation like mine can rely on your products for two separate solutions is pretty awesome!

      Now, to

      My setup:

      I have a couple of PfSense boxes located on two Dell blades (iDracs), PowerEdge R210 II. Each have a virtual bridged interface between WAN and LAN and function as a bridged firewall. They are redundantly configured via STP, so that connection is cut to the secondary firewall when ever the primary firewall is responding with BPDU-packets.

      Hardware:

      igb0-3 (the bridged interfaces):
      Intel(R) PRO/1000 Network Connection version - 2.4.0
      Using MSIX interrupts with 5 vectors

      Intel(R) Xeon(R) CPU E3-1220 V2 @ 3.10GHz
      Current: 3100 MHz, Max: 3101 MHz
      4 CPUs: 1 package(s) x 4 core(s)

      And my build:
      2.2.6-RELEASE (amd64)
      built on Mon Dec 21 14:50:08 CST 2015
      FreeBSD 10.1-RELEASE-p25

      The problem:

      Every other or third day, the primary firewall crashes, failing over to the secondary. I have attached a text-file with a dump.
      I take note of the following message, even though I am not 100% sure of how I should interpret it:

      Fatal trap 12: page fault while in kernel mode
      cpuid = 2; apic id = 04
      fault virtual address    = 0x1d
      fault code        = supervisor read data, page not present
      instruction pointer    = 0x20:0xffffffff80b904b7
      stack pointer            = 0x28:0xfffffe001a3d06c0
      frame pointer            = 0x28:0xfffffe001a3d0740
      code segment        = base 0x0, limit 0xfffff, type 0x1b
                  = DPL 0, pres 1, long 1, def32 0, gran 1
      processor eflags    = interrupt enabled, resume, IOPL = 0
      current process        = 12 (irq276: igb2:que 2)
      version.txt06000027512746101761  7624 ustarrootwheelFreeBSD 10.1-RELEASE-p25 #0 c39b63e(releng/10.1)-dirty: Mon Dec 21 15:20:13 CST 2015
          root@pfs22-amd64-builder:/usr/obj.RELENG_2_2.amd64/usr/pfSensesrc/src.RELENG_2_2/sys/pfSense_SMP.10

      Observations:

      I have monitored traffic on the inside (LAN) interfaces of the firewalls, and you can see two attached images of our primary and secondary firewalls.
      On the graphs, "outbound" means outbound from the firewall via the LAN-interface, i.e. from WAN to LAN.

      Firstly, I have attached an image of what I believe to be a precursor;

      Normally, I expect equal amounts of traffic on both firewalls, as they function as bridges and simply pass on all packets (firewalled, of course). Packets are blocked by STP on a later switch on the WAN-side. On the "precursor-graphs", we see a sudden spike in traffic on only the primary firewall, after which traffic flows unevenly. The spike is around 200 Mbit, which is also observed in other "precursors".

      Next, I have attached an image of the actual crash;

      About an hour or two later, everything looks fine, except that the primary firewall just "disappears" on the graphs all of a sudden. This is because of the kernel crash.

      Now I do not know if the spikes and the crashes are even related - they may not be. I just found it odd. Especially since this abnormality has been observed more than once. See the file "another-crash".

      Dianosis?:

      Since the crash report says "current process        = 12 (irq276: igb2:que 2)", I have given it some thought that it may be because our TCP queue length is insufficient on the WAN-interface (igb2), and that a queue too large triggers a crash. The queue is set to a default of 1000, which can be turned up in case of heavy load. This guy (https://forum.pfsense.org/index.php?topic=68919.0) has done something similar, although he doesn't experience crashes as we do.

      I would love any feedback on this, as it is hard for me to troubleshoot this.
      Remember, I am not sure my "precursor"-observations are even relevant. It just seems odd.

      Cheers! :)
      firewall-precursor.PNG
      firewall-precursor.PNG_thumb
      firewall-crash.PNG
      firewall-crash.PNG_thumb
      another-crash.PNG
      another-crash.PNG_thumb
      fw-1-panick.txt

      1 Reply Last reply Reply Quote 0
      • H
        heper
        last edited by

        putting up the entire crash report on pastebin (or similar) might provide more clues for some of the veteran members or developers.

        1 Reply Last reply Reply Quote 0
        • J
          JDK
          last edited by

          Please see below pastebin:

          http://pastebin.com/wsiTU46i

          Thank you :)

          1 Reply Last reply Reply Quote 0
          • H
            heper
            last edited by

            Try increasing your mbufs

            
            zone: mbuf] kern.ipc.nmbufs limit reached
            [zone: mbuf] kern.ipc.nmbufs limit reached
            [zone: mbuf] kern.ipc.nmbufs limit reached
            [zone: mbuf] kern.ipc.nmbufs limit reached
            
            

            https://doc.pfsense.org/index.php/Tuning_and_Troubleshooting_Network_Cards#mbuf_.2F_nmbclusters

            1 Reply Last reply Reply Quote 0
            • J
              JDK
              last edited by

              Thank you very much for your reply :)

              I tried doing the following changes to system tunables:

              kern.ipc.nmbclusters="131072"

              Which is actually down from 1.000.000 which we had it at before, because BlueKobold from https://forum.pfsense.org/index.php?topic=107217.0
              suggests that large mbufs can incur stability issues.

              Furthermore, I made changes to the bootloader, because we have the firewall on Dell blades, and that Dell machinery with Broadcom bce(x) interfaces have had problems with stability as a result of mbuf-size, tso and msix (https://doc.pfsense.org/index.php/Tuning_and_Troubleshooting_Network_Cards#mbuf_.2F_nmbclusters):

              kern.ipc.nmbclusters="131072"
              hw.bce.tso_enable=0
              hw.pci.enable_msix=0

              Will keep you posted :)

              1 Reply Last reply Reply Quote 0
              • J
                JDK
                last edited by

                This did not help.

                I still have the same report: Mbuf limit reached (even though I am nowhere near Mbuf exhaustion (we're talking 7%), and I have ~3 gigs of ram still available.

                See attachments :)

                ![26-08-16 - mbuf.png](/public/imported_attachments/1/26-08-16 - mbuf.png)
                ![26-08-16 - mbuf.png_thumb](/public/imported_attachments/1/26-08-16 - mbuf.png_thumb)
                ![26-08-16 - memory.png](/public/imported_attachments/1/26-08-16 - memory.png)
                ![26-08-16 - memory.png_thumb](/public/imported_attachments/1/26-08-16 - memory.png_thumb)

                1 Reply Last reply Reply Quote 0
                • D
                  divsys
                  last edited by

                  Sorry I'm not good enough to properly diagnose the dump log, but I have seen references to tuneables affecting the igb interfaces you're using.

                  Perhaps some of the notes in:https://doc.pfsense.org/index.php/Tuning_and_Troubleshooting_Network_Cards would be helpful?

                  -jfp

                  1 Reply Last reply Reply Quote 0
                  • H
                    Harvy66
                    last edited by

                    Just taking a guess, but a Seg fault about not enough memory when there is sounds like a hardware error, or possibly a driver bug. If you have ECC memory, is there a way you can check for memory errors?

                    1 Reply Last reply Reply Quote 0
                    • jimpJ
                      jimp Rebel Alliance Developer Netgate
                      last edited by

                      Add a tunable for kern.ipc.nmbufs=1000000 and see if that helps.

                      Also post the output of "netstat -m" just after a reboot and then after running a day or so.

                      Remember: Upvote with the 👍 button for any user/post you find to be helpful, informative, or deserving of recognition!

                      Need help fast? Netgate Global Support!

                      Do not Chat/PM for help!

                      1 Reply Last reply Reply Quote 0
                      • First post
                        Last post
                      Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.