Navigation

    Netgate Discussion Forum
    • Register
    • Login
    • Search
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search

    PfSense 2.4.3-RELEASE hang/crash reboots - "Fatal trap 9:"

    General pfSense Questions
    2
    17
    1795
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • B
      breakaway last edited by

      Version:
      -pfSense 2.4.3-RELEASE

      Hardware: -As a virtual machine on top of VMWare ESXi 5.5 Update 2 (build 5230635).

      • Machine has 4 vCPU and 1 GB RAM assigned to it along with a 20 GB thin provisioned vDisk
      • VM Version "vmx-09".
      • 2 x NICs (em0 and em1)
      • em0 is used for WAN connectivity
      • em1 is used as a VLAN trunk (I have about 10 VLANs for various networks)

      What this machine does:

      • Internet Connectivity
      • All our remote sites VPN back to this pfSense over IPSEC

      Packages Installed:

      • Open-vm-tools
      • pfBlockerNG
      • Acme (Lets Encrypt Cert renewal)

      The problem: pfSense reboots at random times (sometimes during heavy load, sometimes when there is no load). This virtual machine has run without issues for several years now only to start exhibiting random reboots from time to time in the past month or so. It happens once every 2-3 days on average (sometimes more frequently). If I log into pfSense just after the reboot, there is a "crash dump available" message at the top of the screen.

      At the very bottom of the crash dump (which is thousands of lines long of course)

      Fatal trap 9: general protection fault while in kernel mode
      cpuid = 2; apic id = 02
      instruction pointer	= 0x20:0xffffffff80e81172
      stack pointer	        = 0x28:0xfffffe004ce829b0
      frame pointer	        = 0x28:0xfffffe004ce82a40
      code segment		= base 0x0, limit 0xfffff, type 0x1b
      			= DPL 0, pres 1, long 1, def32 0, gran 1
      processor eflags	= interrupt enabled, resume, IOPL = 0
      current process		= 12 (swi4: clock (0))
      version.txt06000027413265503336  7624 ustarrootwheelFreeBSD 11.1-RELEASE-p7 #10 r313908+986837ba7e9(RELENG_2_4): Mon Mar 26 18:08:25 CDT 2018
          root@buildbot2.netgate.com:/builder/ce-243/tmp/obj/builder/ce-243/tmp/FreeBSD-src/sys/pfSense
      

      So far, I have done a reinstall of pfSense using the latest 2.4.3 media and by using the "rescue config.xml" option in the shell. Obviously this has not helped. What else can I do to figure out why this is happening? As far as I am aware, there were no changes made to our infrastructure when this issue started happening so not too sure what could be the root cause.

      1 Reply Last reply Reply Quote 0
      • jimp
        jimp Rebel Alliance Developer Netgate last edited by

        That's not enough of the crash dump to say for sure what happened. Need to see the backtrace near the start of the dump. It should be keeping a copy of that crash dump and showing it to you in the GUI after it reboots.

        If not then perhaps your VM doesn't have any swap space configured.

        Remember: Upvote with the 👍 button for any user/post you find to be helpful, informative, or deserving of recognition!

        Need help fast? Netgate Global Support!

        Do not Chat/PM for help!

        1 Reply Last reply Reply Quote 0
        • B
          breakaway last edited by

          jimp, am I able to PM the crash dump to you? I'd rather not post that stuff on a public forum.

          1 Reply Last reply Reply Quote 0
          • jimp
            jimp Rebel Alliance Developer Netgate last edited by

            Yes, that's fine.

            Remember: Upvote with the 👍 button for any user/post you find to be helpful, informative, or deserving of recognition!

            Need help fast? Netgate Global Support!

            Do not Chat/PM for help!

            1 Reply Last reply Reply Quote 0
            • B
              breakaway last edited by

              Thanks jimp, much appreciated. I have pm'ed the logs to you.

              1 Reply Last reply Reply Quote 0
              • jimp
                jimp Rebel Alliance Developer Netgate last edited by

                OK, nothing private in the backtrace so that should be OK here (from the archive you sent):

                db:0:kdb.enter.default>  bt
                Tracing pid 0 tid 100048 td 0xfffff8000365d000
                key_addref() at key_addref+0x4/frame 0xfffffe004cf96640
                ipsec_getpcbpolicy() at ipsec_getpcbpolicy+0x51/frame 0xfffffe004cf96680
                ipsec4_getpolicy() at ipsec4_getpolicy+0x25/frame 0xfffffe004cf96720
                ipsec4_in_reject() at ipsec4_in_reject+0x1d/frame 0xfffffe004cf96750
                udp_append() at udp_append+0xaa/frame 0xfffffe004cf967c0
                udp_input() at udp_input+0x49b/frame 0xfffffe004cf96880
                ip_input() at ip_input+0x135/frame 0xfffffe004cf968e0
                netisr_dispatch_src() at netisr_dispatch_src+0xa0/frame 0xfffffe004cf96930
                ether_demux() at ether_demux+0x16d/frame 0xfffffe004cf96960
                ether_nh_input() at ether_nh_input+0x337/frame 0xfffffe004cf969c0
                netisr_dispatch_src() at netisr_dispatch_src+0xa0/frame 0xfffffe004cf96a10
                ether_input() at ether_input+0x26/frame 0xfffffe004cf96a30
                if_input() at if_input+0xa/frame 0xfffffe004cf96a40
                lem_rxeof() at lem_rxeof+0x3ef/frame 0xfffffe004cf96ae0
                lem_handle_rxtx() at lem_handle_rxtx+0x32/frame 0xfffffe004cf96b20
                taskqueue_run_locked() at taskqueue_run_locked+0x147/frame 0xfffffe004cf96b80
                taskqueue_thread_loop() at taskqueue_thread_loop+0xb8/frame 0xfffffe004cf96bb0
                fork_exit() at fork_exit+0x85/frame 0xfffffe004cf96bf0
                fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe004cf96bf0
                
                

                That appears to be a crash in IPsec key management, though I can't recall seeing a crash ever happen there.

                Is there any way you can try it on a current ESX like a fully patched 6.5/6.5U1/6.7?

                pfSense 2.4.3 is based on FreeBSD 11.1 and VMWare claims that FreeBSD 11 is only compatible with 6.5 or newer. Though I know some people run it on 5.5, there is always a chance for compatibility issues.

                Remember: Upvote with the 👍 button for any user/post you find to be helpful, informative, or deserving of recognition!

                Need help fast? Netgate Global Support!

                Do not Chat/PM for help!

                1 Reply Last reply Reply Quote 0
                • B
                  breakaway last edited by

                  Hi,

                  Here's another crash. The one I sent you was from yesterday AM. This one is from Wednesday last week (18th)

                  db:0:kdb.enter.default>  bt
                  Tracing pid 12 tid 100008 td 0xfffff800032eb5c0
                  key_timehandler() at key_timehandler+0x732/frame 0xfffffe004ce82a40
                  softclock_call_cc() at softclock_call_cc+0x13b/frame 0xfffffe004ce82af0
                  softclock() at softclock+0xb9/frame 0xfffffe004ce82b20
                  intr_event_execute_handlers() at intr_event_execute_handlers+0xec/frame 0xfffffe004ce82b60
                  ithread_loop() at ithread_loop+0xd6/frame 0xfffffe004ce82bb0
                  fork_exit() at fork_exit+0x85/frame 0xfffffe004ce82bf0
                  fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe004ce82bf0
                  --- trap 0, rip = 0, rsp = 0, rbp = 0 ---
                  

                  Does this look ipsec related too? I don't see any mention of ipsec in there.

                  Anyway, I am getting desperate to resolve this so last night I rebuilt the VM with vmxnet3 interfaces rather than e1000.

                  Changing VMWare versions isn't relaly an option at this stage - we are part way through a migration to prodmox.

                  1 Reply Last reply Reply Quote 0
                  • B
                    breakaway last edited by

                    Ok, unfortunately it crashed again just now. I'm not convinced this is an issue with pfsense/esxi interop issue. I've got ~17 pfSense 2.4.2-RELEASE, almost all of them of them running on VMWare ESXi 5.5 U3 without issue. That's BSD 11.1 as well - none of those crash.

                    db:0:kdb.enter.default>  bt
                    Tracing pid 35907 tid 100123 td 0xfffff8000e19a5c0
                    in_pcbfree() at in_pcbfree+0x143/frame 0xfffffe0000208940
                    udp_detach() at udp_detach+0xa2/frame 0xfffffe0000208970
                    sofree() at sofree+0x101/frame 0xfffffe00002089a0
                    soclose() at soclose+0x366/frame 0xfffffe00002089f0
                    closef() at closef+0x264/frame 0xfffffe0000208a80
                    closefp() at closefp+0x9d/frame 0xfffffe0000208ac0
                    amd64_syscall() at amd64_syscall+0xa4c/frame 0xfffffe0000208bf0
                    fast_syscall_common() at fast_syscall_common+0x106/frame 0x7fffffffe810
                    
                    1 Reply Last reply Reply Quote 0
                    • jimp
                      jimp Rebel Alliance Developer Netgate last edited by

                      Those last two do not appear to be related to IPsec. The fact that every panic is different would usually make me lean toward hardware though, but since it's virtual that gets trickier.

                      They are vaguely like crashes we used to see a long time ago when the NIC queues had to be reduced to 1. You might try that, see if it helps. https://doc.pfsense.org/index.php/Tuning_and_Troubleshooting_Network_Cards#Intel_igb.284.29_and_em.284.29_Cards

                      Remember: Upvote with the 👍 button for any user/post you find to be helpful, informative, or deserving of recognition!

                      Need help fast? Netgate Global Support!

                      Do not Chat/PM for help!

                      1 Reply Last reply Reply Quote 0
                      • B
                        breakaway last edited by

                        That most recent crash - that happened with a vmxnet adapter (VMXNET 3 VMWare Paravirtual Adapter).

                        Do you still think it is worthwhile reducing the queues to one? The page you linked does not reference vmxnet network adapters at all. What is the line to put into /boot/loader.conf to reduce queues on the vmxnet adapter to 1?

                        1 Reply Last reply Reply Quote 0
                        • jimp
                          jimp Rebel Alliance Developer Netgate last edited by

                          I'm not sure there is actually a tunable for that, but you can try.

                          We only saw those particular crashes with igb, not vmxnet* or em.

                          Remember: Upvote with the 👍 button for any user/post you find to be helpful, informative, or deserving of recognition!

                          Need help fast? Netgate Global Support!

                          Do not Chat/PM for help!

                          1 Reply Last reply Reply Quote 0
                          • B
                            breakaway last edited by

                            This is still happening. It was fine for 4 days then restarted twice in an hour. Each time the stack trace shows some different faulting component.

                            I have set up a brand new VWMare ESXi host running VMWare ESXi 6.0 U3 (Released Feb 2018). I have moved our router VM to this host, edited virtual machine to latest (vm version 11) that is supported on ESXi 6.0 U3… lets see how this pans out.

                            One odd thing I've noticed is a message in the Web Client which says

                            The configured guest OS (FreeBSD (64-bit)) for this virtual machine does not match the guest that is currently running (FreeBSD 11.1-RELEASE-p7). You should specify the correct guest OS to allow for guest-specific optimizations.

                            Not sure what that's all about…

                            1 Reply Last reply Reply Quote 0
                            • jimp
                              jimp Rebel Alliance Developer Netgate last edited by

                              That last bit is fixed in ESX 6.7. It's harmless, though.

                              6.0 U3 still isn't technically compatible with FreeBSD 11.x. It may work but ESX only claims support for 6.5 and later.

                              Remember: Upvote with the 👍 button for any user/post you find to be helpful, informative, or deserving of recognition!

                              Need help fast? Netgate Global Support!

                              Do not Chat/PM for help!

                              1 Reply Last reply Reply Quote 0
                              • B
                                breakaway last edited by

                                Ok, I am going to consider this "fixed" now. The system has been up for over 5 days. This is the longest the system has stayed up since the rebooting issue started (very unusual to make it past 3-4 days).

                                Must be some sort of interop issue with FreeBSD 11.x and VMWare 5.5 U3.

                                Note that I actually neglected to check the VM version. It was actually running VM Version 8 which is the default hardware version for ESXi 5.0 which is when this deployment was put in. Simply updating this to VMX version 10 which is the latest supported for ESXi 5.5 could potentially have solved the problem, I just totallly missed that this could be the cause of the issue.

                                1 Reply Last reply Reply Quote 0
                                • B
                                  breakaway last edited by

                                  Spoke too soon, it just crashed and rebooted.

                                  db:0:kdb.enter.default>  bt
                                  Tracing pid 7 tid 100044 td 0xfffff8000342a5c0
                                  kdb_enter() at kdb_enter+0x3b/frame 0xfffffe004cf361e0
                                  vpanic() at vpanic+0x1a3/frame 0xfffffe004cf36260
                                  panic() at panic+0x43/frame 0xfffffe004cf362c0
                                  complete_jsegs() at complete_jsegs+0x854/frame 0xfffffe004cf36310
                                  softdep_disk_write_complete() at softdep_disk_write_complete+0x42c/frame 0xfffffe004cf36370
                                  bufdone_finish() at bufdone_finish+0x34/frame 0xfffffe004cf363e0
                                  bufdone() at bufdone+0x87/frame 0xfffffe004cf36400
                                  g_io_deliver() at g_io_deliver+0x205/frame 0xfffffe004cf36460
                                  g_io_deliver() at g_io_deliver+0x205/frame 0xfffffe004cf364c0
                                  g_io_deliver() at g_io_deliver+0x205/frame 0xfffffe004cf36520
                                  g_disk_done() at g_disk_done+0x129/frame 0xfffffe004cf36570
                                  dadone() at dadone+0x1826/frame 0xfffffe004cf36b20
                                  xpt_done_process() at xpt_done_process+0x677/frame 0xfffffe004cf36b60
                                  xpt_done_td() at xpt_done_td+0x196/frame 0xfffffe004cf36bb0
                                  fork_exit() at fork_exit+0x85/frame 0xfffffe004cf36bf0
                                  fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe004cf36bf0
                                  --- trap 0, rip = 0, rsp = 0, rbp = 0 ---
                                  

                                  Tonight, I am going to create a brand new VM with brand new config file/disks, install pfSense and restore config to see if that stops this.

                                  If that doesn't work I am out of ideas - I have other deployment that run this exact same setup (ESXi 5.5 U3, pfSense, IPSEC, pfBlocker NG) that don't have this problem.

                                  1 Reply Last reply Reply Quote 0
                                  • B
                                    breakaway last edited by

                                    I didn't end up rebuilding the router. That's way too much work. I instead decided to disable AES-NI on the system and switch all IPSEC tunnels from AES-GCM to Blowfish to see if that will help.

                                    It has now been over 7 days, no reboots. I have never made it this long without a reboot before.

                                    1 Reply Last reply Reply Quote 0
                                    • B
                                      breakaway last edited by

                                      Ok I am calling this fixed. I've got an uptime of 14 days after disabling AES/NI on this machine. Previously I couldn't make it past 4-5 days.

                                      1 Reply Last reply Reply Quote 0
                                      • First post
                                        Last post