Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    PfSense 2.4.3-RELEASE hang/crash reboots - "Fatal trap 9:"

    Scheduled Pinned Locked Moved General pfSense Questions
    17 Posts 2 Posters 2.4k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • B
      breakaway
      last edited by

      Ok, unfortunately it crashed again just now. I'm not convinced this is an issue with pfsense/esxi interop issue. I've got ~17 pfSense 2.4.2-RELEASE, almost all of them of them running on VMWare ESXi 5.5 U3 without issue. That's BSD 11.1 as well - none of those crash.

      db:0:kdb.enter.default>  bt
      Tracing pid 35907 tid 100123 td 0xfffff8000e19a5c0
      in_pcbfree() at in_pcbfree+0x143/frame 0xfffffe0000208940
      udp_detach() at udp_detach+0xa2/frame 0xfffffe0000208970
      sofree() at sofree+0x101/frame 0xfffffe00002089a0
      soclose() at soclose+0x366/frame 0xfffffe00002089f0
      closef() at closef+0x264/frame 0xfffffe0000208a80
      closefp() at closefp+0x9d/frame 0xfffffe0000208ac0
      amd64_syscall() at amd64_syscall+0xa4c/frame 0xfffffe0000208bf0
      fast_syscall_common() at fast_syscall_common+0x106/frame 0x7fffffffe810
      
      1 Reply Last reply Reply Quote 0
      • jimpJ
        jimp Rebel Alliance Developer Netgate
        last edited by

        Those last two do not appear to be related to IPsec. The fact that every panic is different would usually make me lean toward hardware though, but since it's virtual that gets trickier.

        They are vaguely like crashes we used to see a long time ago when the NIC queues had to be reduced to 1. You might try that, see if it helps. https://doc.pfsense.org/index.php/Tuning_and_Troubleshooting_Network_Cards#Intel_igb.284.29_and_em.284.29_Cards

        Remember: Upvote with the 👍 button for any user/post you find to be helpful, informative, or deserving of recognition!

        Need help fast? Netgate Global Support!

        Do not Chat/PM for help!

        1 Reply Last reply Reply Quote 0
        • B
          breakaway
          last edited by

          That most recent crash - that happened with a vmxnet adapter (VMXNET 3 VMWare Paravirtual Adapter).

          Do you still think it is worthwhile reducing the queues to one? The page you linked does not reference vmxnet network adapters at all. What is the line to put into /boot/loader.conf to reduce queues on the vmxnet adapter to 1?

          1 Reply Last reply Reply Quote 0
          • jimpJ
            jimp Rebel Alliance Developer Netgate
            last edited by

            I'm not sure there is actually a tunable for that, but you can try.

            We only saw those particular crashes with igb, not vmxnet* or em.

            Remember: Upvote with the 👍 button for any user/post you find to be helpful, informative, or deserving of recognition!

            Need help fast? Netgate Global Support!

            Do not Chat/PM for help!

            1 Reply Last reply Reply Quote 0
            • B
              breakaway
              last edited by

              This is still happening. It was fine for 4 days then restarted twice in an hour. Each time the stack trace shows some different faulting component.

              I have set up a brand new VWMare ESXi host running VMWare ESXi 6.0 U3 (Released Feb 2018). I have moved our router VM to this host, edited virtual machine to latest (vm version 11) that is supported on ESXi 6.0 U3… lets see how this pans out.

              One odd thing I've noticed is a message in the Web Client which says

              The configured guest OS (FreeBSD (64-bit)) for this virtual machine does not match the guest that is currently running (FreeBSD 11.1-RELEASE-p7). You should specify the correct guest OS to allow for guest-specific optimizations.

              Not sure what that's all about…

              1 Reply Last reply Reply Quote 0
              • jimpJ
                jimp Rebel Alliance Developer Netgate
                last edited by

                That last bit is fixed in ESX 6.7. It's harmless, though.

                6.0 U3 still isn't technically compatible with FreeBSD 11.x. It may work but ESX only claims support for 6.5 and later.

                Remember: Upvote with the 👍 button for any user/post you find to be helpful, informative, or deserving of recognition!

                Need help fast? Netgate Global Support!

                Do not Chat/PM for help!

                1 Reply Last reply Reply Quote 0
                • B
                  breakaway
                  last edited by

                  Ok, I am going to consider this "fixed" now. The system has been up for over 5 days. This is the longest the system has stayed up since the rebooting issue started (very unusual to make it past 3-4 days).

                  Must be some sort of interop issue with FreeBSD 11.x and VMWare 5.5 U3.

                  Note that I actually neglected to check the VM version. It was actually running VM Version 8 which is the default hardware version for ESXi 5.0 which is when this deployment was put in. Simply updating this to VMX version 10 which is the latest supported for ESXi 5.5 could potentially have solved the problem, I just totallly missed that this could be the cause of the issue.

                  1 Reply Last reply Reply Quote 0
                  • B
                    breakaway
                    last edited by

                    Spoke too soon, it just crashed and rebooted.

                    db:0:kdb.enter.default>  bt
                    Tracing pid 7 tid 100044 td 0xfffff8000342a5c0
                    kdb_enter() at kdb_enter+0x3b/frame 0xfffffe004cf361e0
                    vpanic() at vpanic+0x1a3/frame 0xfffffe004cf36260
                    panic() at panic+0x43/frame 0xfffffe004cf362c0
                    complete_jsegs() at complete_jsegs+0x854/frame 0xfffffe004cf36310
                    softdep_disk_write_complete() at softdep_disk_write_complete+0x42c/frame 0xfffffe004cf36370
                    bufdone_finish() at bufdone_finish+0x34/frame 0xfffffe004cf363e0
                    bufdone() at bufdone+0x87/frame 0xfffffe004cf36400
                    g_io_deliver() at g_io_deliver+0x205/frame 0xfffffe004cf36460
                    g_io_deliver() at g_io_deliver+0x205/frame 0xfffffe004cf364c0
                    g_io_deliver() at g_io_deliver+0x205/frame 0xfffffe004cf36520
                    g_disk_done() at g_disk_done+0x129/frame 0xfffffe004cf36570
                    dadone() at dadone+0x1826/frame 0xfffffe004cf36b20
                    xpt_done_process() at xpt_done_process+0x677/frame 0xfffffe004cf36b60
                    xpt_done_td() at xpt_done_td+0x196/frame 0xfffffe004cf36bb0
                    fork_exit() at fork_exit+0x85/frame 0xfffffe004cf36bf0
                    fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe004cf36bf0
                    --- trap 0, rip = 0, rsp = 0, rbp = 0 ---
                    

                    Tonight, I am going to create a brand new VM with brand new config file/disks, install pfSense and restore config to see if that stops this.

                    If that doesn't work I am out of ideas - I have other deployment that run this exact same setup (ESXi 5.5 U3, pfSense, IPSEC, pfBlocker NG) that don't have this problem.

                    1 Reply Last reply Reply Quote 0
                    • B
                      breakaway
                      last edited by

                      I didn't end up rebuilding the router. That's way too much work. I instead decided to disable AES-NI on the system and switch all IPSEC tunnels from AES-GCM to Blowfish to see if that will help.

                      It has now been over 7 days, no reboots. I have never made it this long without a reboot before.

                      1 Reply Last reply Reply Quote 0
                      • B
                        breakaway
                        last edited by

                        Ok I am calling this fixed. I've got an uptime of 14 days after disabling AES/NI on this machine. Previously I couldn't make it past 4-5 days.

                        1 Reply Last reply Reply Quote 0
                        • First post
                          Last post
                        Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.