• Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Search
  • Register
  • Login
Netgate Discussion Forum
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Search
  • Register
  • Login

Fatal error - regularly

Scheduled Pinned Locked Moved General pfSense Questions
9 Posts 5 Posters 1.7k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • J
    JDK
    last edited by Aug 22, 2016, 9:08 AM

    This is a duplicate of post 115893 to change the posts ownership

    First of all, thank you for all your effort; that a corporation like mine can rely on your products for two separate solutions is pretty awesome!

    Now, to

    My setup:

    I have a couple of PfSense boxes located on two Dell blades (iDracs), PowerEdge R210 II. Each have a virtual bridged interface between WAN and LAN and function as a bridged firewall. They are redundantly configured via STP, so that connection is cut to the secondary firewall when ever the primary firewall is responding with BPDU-packets.

    Hardware:

    igb0-3 (the bridged interfaces):
    Intel(R) PRO/1000 Network Connection version - 2.4.0
    Using MSIX interrupts with 5 vectors

    Intel(R) Xeon(R) CPU E3-1220 V2 @ 3.10GHz
    Current: 3100 MHz, Max: 3101 MHz
    4 CPUs: 1 package(s) x 4 core(s)

    And my build:
    2.2.6-RELEASE (amd64)
    built on Mon Dec 21 14:50:08 CST 2015
    FreeBSD 10.1-RELEASE-p25

    The problem:

    Every other or third day, the primary firewall crashes, failing over to the secondary. I have attached a text-file with a dump.
    I take note of the following message, even though I am not 100% sure of how I should interpret it:

    Fatal trap 12: page fault while in kernel mode
    cpuid = 2; apic id = 04
    fault virtual address    = 0x1d
    fault code        = supervisor read data, page not present
    instruction pointer    = 0x20:0xffffffff80b904b7
    stack pointer            = 0x28:0xfffffe001a3d06c0
    frame pointer            = 0x28:0xfffffe001a3d0740
    code segment        = base 0x0, limit 0xfffff, type 0x1b
                = DPL 0, pres 1, long 1, def32 0, gran 1
    processor eflags    = interrupt enabled, resume, IOPL = 0
    current process        = 12 (irq276: igb2:que 2)
    version.txt06000027512746101761  7624 ustarrootwheelFreeBSD 10.1-RELEASE-p25 #0 c39b63e(releng/10.1)-dirty: Mon Dec 21 15:20:13 CST 2015
        root@pfs22-amd64-builder:/usr/obj.RELENG_2_2.amd64/usr/pfSensesrc/src.RELENG_2_2/sys/pfSense_SMP.10

    Observations:

    I have monitored traffic on the inside (LAN) interfaces of the firewalls, and you can see two attached images of our primary and secondary firewalls.
    On the graphs, "outbound" means outbound from the firewall via the LAN-interface, i.e. from WAN to LAN.

    Firstly, I have attached an image of what I believe to be a precursor;

    Normally, I expect equal amounts of traffic on both firewalls, as they function as bridges and simply pass on all packets (firewalled, of course). Packets are blocked by STP on a later switch on the WAN-side. On the "precursor-graphs", we see a sudden spike in traffic on only the primary firewall, after which traffic flows unevenly. The spike is around 200 Mbit, which is also observed in other "precursors".

    Next, I have attached an image of the actual crash;

    About an hour or two later, everything looks fine, except that the primary firewall just "disappears" on the graphs all of a sudden. This is because of the kernel crash.

    Now I do not know if the spikes and the crashes are even related - they may not be. I just found it odd. Especially since this abnormality has been observed more than once. See the file "another-crash".

    Dianosis?:

    Since the crash report says "current process        = 12 (irq276: igb2:que 2)", I have given it some thought that it may be because our TCP queue length is insufficient on the WAN-interface (igb2), and that a queue too large triggers a crash. The queue is set to a default of 1000, which can be turned up in case of heavy load. This guy (https://forum.pfsense.org/index.php?topic=68919.0) has done something similar, although he doesn't experience crashes as we do.

    I would love any feedback on this, as it is hard for me to troubleshoot this.
    Remember, I am not sure my "precursor"-observations are even relevant. It just seems odd.

    Cheers! :)
    firewall-precursor.PNG
    firewall-precursor.PNG_thumb
    firewall-crash.PNG
    firewall-crash.PNG_thumb
    another-crash.PNG
    another-crash.PNG_thumb
    fw-1-panick.txt

    1 Reply Last reply Reply Quote 0
    • H
      heper
      last edited by Aug 22, 2016, 9:29 AM

      putting up the entire crash report on pastebin (or similar) might provide more clues for some of the veteran members or developers.

      1 Reply Last reply Reply Quote 0
      • J
        JDK
        last edited by Aug 22, 2016, 11:03 AM

        Please see below pastebin:

        http://pastebin.com/wsiTU46i

        Thank you :)

        1 Reply Last reply Reply Quote 0
        • H
          heper
          last edited by Aug 22, 2016, 12:27 PM

          Try increasing your mbufs

          
          zone: mbuf] kern.ipc.nmbufs limit reached
          [zone: mbuf] kern.ipc.nmbufs limit reached
          [zone: mbuf] kern.ipc.nmbufs limit reached
          [zone: mbuf] kern.ipc.nmbufs limit reached
          
          

          https://doc.pfsense.org/index.php/Tuning_and_Troubleshooting_Network_Cards#mbuf_.2F_nmbclusters

          1 Reply Last reply Reply Quote 0
          • J
            JDK
            last edited by Aug 24, 2016, 6:50 AM

            Thank you very much for your reply :)

            I tried doing the following changes to system tunables:

            kern.ipc.nmbclusters="131072"

            Which is actually down from 1.000.000 which we had it at before, because BlueKobold from https://forum.pfsense.org/index.php?topic=107217.0
            suggests that large mbufs can incur stability issues.

            Furthermore, I made changes to the bootloader, because we have the firewall on Dell blades, and that Dell machinery with Broadcom bce(x) interfaces have had problems with stability as a result of mbuf-size, tso and msix (https://doc.pfsense.org/index.php/Tuning_and_Troubleshooting_Network_Cards#mbuf_.2F_nmbclusters):

            kern.ipc.nmbclusters="131072"
            hw.bce.tso_enable=0
            hw.pci.enable_msix=0

            Will keep you posted :)

            1 Reply Last reply Reply Quote 0
            • J
              JDK
              last edited by Aug 26, 2016, 8:51 AM

              This did not help.

              I still have the same report: Mbuf limit reached (even though I am nowhere near Mbuf exhaustion (we're talking 7%), and I have ~3 gigs of ram still available.

              See attachments :)

              ![26-08-16 - mbuf.png](/public/imported_attachments/1/26-08-16 - mbuf.png)
              ![26-08-16 - mbuf.png_thumb](/public/imported_attachments/1/26-08-16 - mbuf.png_thumb)
              ![26-08-16 - memory.png](/public/imported_attachments/1/26-08-16 - memory.png)
              ![26-08-16 - memory.png_thumb](/public/imported_attachments/1/26-08-16 - memory.png_thumb)

              1 Reply Last reply Reply Quote 0
              • D
                divsys
                last edited by Aug 26, 2016, 5:28 PM

                Sorry I'm not good enough to properly diagnose the dump log, but I have seen references to tuneables affecting the igb interfaces you're using.

                Perhaps some of the notes in:https://doc.pfsense.org/index.php/Tuning_and_Troubleshooting_Network_Cards would be helpful?

                -jfp

                1 Reply Last reply Reply Quote 0
                • H
                  Harvy66
                  last edited by Aug 26, 2016, 8:38 PM

                  Just taking a guess, but a Seg fault about not enough memory when there is sounds like a hardware error, or possibly a driver bug. If you have ECC memory, is there a way you can check for memory errors?

                  1 Reply Last reply Reply Quote 0
                  • J
                    jimp Rebel Alliance Developer Netgate
                    last edited by Aug 30, 2016, 3:54 PM

                    Add a tunable for kern.ipc.nmbufs=1000000 and see if that helps.

                    Also post the output of "netstat -m" just after a reboot and then after running a day or so.

                    Remember: Upvote with the 👍 button for any user/post you find to be helpful, informative, or deserving of recognition!

                    Need help fast? Netgate Global Support!

                    Do not Chat/PM for help!

                    1 Reply Last reply Reply Quote 0
                    1 out of 9
                    • First post
                      1/9
                      Last post
                    Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.
                      This community forum collects and processes your personal information.
                      consent.not_received