Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    Random crash on latest 23.09.1

    Scheduled Pinned Locked Moved General pfSense Questions
    21 Posts 3 Posters 1.8k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Y
      Yathus
      last edited by

      Hello,

      We have 2 pfsenses in CARP mode on top on our "private cloud" (Vmware Esx 7.0)

      Everything run fine for some months but now we have some random crash like this one :

      Fatal trap 12: page fault while in kernel mode
      cpuid = 0; apic id = 00
      fault virtual address	= 0xfffffe00639ff008
      fault code		= supervisor read data, page not present
      instruction pointer	= 0x20:0xffffffff80af8550
      stack pointer	        = 0x0:0xfffffe00085ae4a0
      frame pointer	        = 0x0:0xfffffe00085ae4a0
      code segment		= base 0x0, limit 0xfffff, type 0x1b
      			= DPL 0, pres 1, long 1, def32 0, gran 1
      processor eflags	= resume, IOPL = 0
      current process		= 0 (if_io_tqg_0)
      rdi: fffff80005ca5000 rsi: 0000000000000000 rdx: 0000000000000000
      rcx: 0000000000000000  r8: 0000000000002000  r9: 0000000000000001
      rax: fffffe00639fd000 rbx: fffff8000595d000 rbp: fffffe00085ae4a0
      r10: 0000000000000000 r11: 0001000000000000 r12: 0000000000000000
      r13: 0000000000000000 r14: fffffe00639f9000 r15: 0000000000000000
      trap number		= 12
      panic: page fault
      cpuid = 0
      time = 1706798799
      KDB: enter: panic
      

      what could cause this type of problem?

      thanks

      Yathus

      1 Reply Last reply Reply Quote 0
      • stephenw10S
        stephenw10 Netgate Administrator
        last edited by

        Do you have the full crash report?

        What VM version are the pfSense VMs using?

        Steve

        Y 1 Reply Last reply Reply Quote 0
        • Y
          Yathus @stephenw10
          last edited by

          @stephenw10 yes i have multiple "dump" i guess. i can share on the forum ? there is no private data ?

          23.09.1-RELEASE
          Running on VMware ESXi, 7.0.3, 22348816 with Latest VM Version : ESXi 7.0 U2 and later (VM version 19)
          VM have 4vCPU, 8Go RAM
          VMXNet 3 driver for network

          1 Reply Last reply Reply Quote 0
          • stephenw10S
            stephenw10 Netgate Administrator
            last edited by

            You can upload it here: https://nc.netgate.com/nextcloud/s/wpxLk4dAJBJBrsR

            Y 1 Reply Last reply Reply Quote 0
            • Y
              Yathus @stephenw10
              last edited by

              @stephenw10 done i uploaded 3 files

              1 Reply Last reply Reply Quote 0
              • stephenw10S
                stephenw10 Netgate Administrator
                last edited by

                Ok that's 3 identical backtraces:

                db:1:pfs> bt
                Tracing pid 0 tid 100011 td 0xfffffe00093aee40
                kdb_enter() at kdb_enter+0x32/frame 0xfffffe00085ae180
                vpanic() at vpanic+0x163/frame 0xfffffe00085ae2b0
                panic() at panic+0x43/frame 0xfffffe00085ae310
                trap_fatal() at trap_fatal+0x40c/frame 0xfffffe00085ae370
                trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00085ae3d0
                calltrap() at calltrap+0x8/frame 0xfffffe00085ae3d0
                --- trap 0xc, rip = 0xffffffff80af8550, rsp = 0xfffffe00085ae4a0, rbp = 0xfffffe00085ae4a0 ---
                vmxnet3_isc_txd_credits_update() at vmxnet3_isc_txd_credits_update+0x20/frame 0xfffffe00085ae4a0
                iflib_fast_intr_rxtx() at iflib_fast_intr_rxtx+0xf7/frame 0xfffffe00085ae500
                intr_event_handle() at intr_event_handle+0x126/frame 0xfffffe00085ae570
                intr_execute_handlers() at intr_execute_handlers+0x49/frame 0xfffffe00085ae5a0
                Xapic_isr2() at Xapic_isr2+0xdc/frame 0xfffffe00085ae5a0
                --- interrupt, rip = 0xffffffff80af85d2, rsp = 0xfffffe00085ae670, rbp = 0xfffffe00085ae670 ---
                vmxnet3_isc_txd_credits_update() at vmxnet3_isc_txd_credits_update+0xa2/frame 0xfffffe00085ae670
                iflib_completed_tx_reclaim() at iflib_completed_tx_reclaim+0x55/frame 0xfffffe00085ae6e0
                iflib_txq_drain() at iflib_txq_drain+0x6b/frame 0xfffffe00085ae760
                drain_ring_lockless() at drain_ring_lockless+0x5e/frame 0xfffffe00085ae7b0
                ifmp_ring_enqueue() at ifmp_ring_enqueue+0x265/frame 0xfffffe00085ae7f0
                iflib_if_transmit() at iflib_if_transmit+0x243/frame 0xfffffe00085ae860
                ether_output_frame() at ether_output_frame+0xa3/frame 0xfffffe00085ae890
                ether_output() at ether_output+0x673/frame 0xfffffe00085ae920
                ip_output_send() at ip_output_send+0xdc/frame 0xfffffe00085ae960
                ip_output() at ip_output+0x1284/frame 0xfffffe00085aea60
                ip_forward() at ip_forward+0x3c2/frame 0xfffffe00085aeb10
                ip_input() at ip_input+0x6e9/frame 0xfffffe00085aeb70
                netisr_dispatch_src() at netisr_dispatch_src+0x22c/frame 0xfffffe00085aebc0
                ether_demux() at ether_demux+0x149/frame 0xfffffe00085aebf0
                ether_nh_input() at ether_nh_input+0x36e/frame 0xfffffe00085aec50
                netisr_dispatch_src() at netisr_dispatch_src+0xaf/frame 0xfffffe00085aeca0
                ether_input() at ether_input+0x69/frame 0xfffffe00085aed00
                iflib_rxeof() at iflib_rxeof+0xc46/frame 0xfffffe00085aee00
                _task_fn_rx() at _task_fn_rx+0x72/frame 0xfffffe00085aee40
                gtaskqueue_run_locked() at gtaskqueue_run_locked+0x14e/frame 0xfffffe00085aeec0
                gtaskqueue_thread_loop() at gtaskqueue_thread_loop+0xc2/frame 0xfffffe00085aeef0
                fork_exit() at fork_exit+0x7f/frame 0xfffffe00085aef30
                fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00085aef30
                --- trap 0, rip = 0, rsp = 0, rbp = 0 ---
                

                And it's the same crash as this thread:
                https://forum.netgate.com/topic/184597/pfsense-reboot-randomly-on-vmware/

                It looks similar to this FreeBSD bug but that is already fixed: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=239118

                I don't see it in the logs but are you running WireGuard?

                You might setting the tunables shown here for the descriptor values:
                https://docs.netgate.com/pfsense/en/latest/hardware/tune.html#vmware-vmx-4-interfaces

                I'll check if there have been any updates there....

                Steve

                1 Reply Last reply Reply Quote 0
                • stephenw10S
                  stephenw10 Netgate Administrator
                  last edited by

                  Ok several devs are looking at this and it looks like there is a suspect.

                  How much SWAP space do you have on there? We may need to enable a full crash dump to confirm this.

                  1 Reply Last reply Reply Quote 0
                  • Y
                    Yathus
                    last edited by Yathus

                    Hello @stephenw10

                    First of all I must thank you for your help :)

                    We are not using Wireguard, only IPSEC Site-to-Site (9 tunnels).
                    We have OpenVPN too, but server are disabled.

                    SWAP is 1024MB

                    For the tunable, i just have to put this :

                    hw.pci.honor_msi_blacklist="0"
                    dev.vmx.<id>.iflib.override_ntxds="0,4096"
                    dev.vmx.<id>.iflib.override_nrxds="0,2048,0"
                    

                    for all my interfaces ?

                    1 Reply Last reply Reply Quote 0
                    • stephenw10S
                      stephenw10 Netgate Administrator
                      last edited by

                      Yes for each vmx NIC.
                      The issue we are looking at looks to be when descriptors are exhausted so if you set those values it should at least take much longer to hit.

                      Y 1 Reply Last reply Reply Quote 0
                      • Y
                        Yathus @stephenw10
                        last edited by

                        thanks @stephenw10 i did the changes, i just have to reboot now my "primary" right now (I'm waiting for the end of working hours, the CARP switchover always generates a small interruption on the IPSECs)

                        Y 1 Reply Last reply Reply Quote 1
                        • Y
                          Yathus @Yathus
                          last edited by

                          reboot done, we'll see ;-)

                          Y 1 Reply Last reply Reply Quote 1
                          • Y
                            Yathus @Yathus
                            last edited by

                            i had my first crash on a vmotion on the secondary pfsense :

                            Fatal trap 12: page fault while in kernel mode
                            cpuid = 3; apic id = 03
                            fault virtual address	= 0x0
                            fault code		= supervisor read data, page not present
                            instruction pointer	= 0x20:0xffffffff80fb1c0a
                            stack pointer	        = 0x0:0xfffffe000859f7d0
                            frame pointer	        = 0x0:0xfffffe000859f920
                            code segment		= base 0x0, limit 0xfffff, type 0x1b
                            			= DPL 0, pres 1, long 1, def32 0, gran 1
                            processor eflags	= interrupt enabled, resume, IOPL = 0
                            current process		= 0 (if_io_tqg_3)
                            rdi: 0000000000000000 rsi: fffff800b4b9c07a rdx: 0000000000000000
                            rcx: 0000000005966257  r8: 00000000a1990c31  r9: 0000000023e34fa7
                            rax: 0000000000000002 rbx: fffff800b4b9c000 rbp: fffffe000859f920
                            r10: 0000000000003354 r11: fffff800b4b9c000 r12: fffffe000859f980
                            r13: 0000000000000000 r14: 0000000000000000 r15: fffff8000cce2608
                            trap number		= 12
                            panic: page fault
                            cpuid = 3
                            time = 1707492630
                            KDB: enter: panic
                            

                            i vmotion the primary and no crash...

                            1 Reply Last reply Reply Quote 0
                            • stephenw10S
                              stephenw10 Netgate Administrator
                              last edited by

                              We need to see the backtrace to know more there.

                              Y 1 Reply Last reply Reply Quote 0
                              • Y
                                Yathus @stephenw10
                                last edited by

                                @stephenw10 i upload files in your nextcloud link.

                                1 Reply Last reply Reply Quote 0
                                • stephenw10S
                                  stephenw10 Netgate Administrator
                                  last edited by

                                  Backtrace:

                                  db:1:pfs> bt
                                  Tracing pid 0 tid 100014 td 0xfffffe000932a740
                                  kdb_enter() at kdb_enter+0x32/frame 0xfffffe000859f4b0
                                  vpanic() at vpanic+0x163/frame 0xfffffe000859f5e0
                                  panic() at panic+0x43/frame 0xfffffe000859f640
                                  trap_fatal() at trap_fatal+0x40c/frame 0xfffffe000859f6a0
                                  trap_pfault() at trap_pfault+0x4f/frame 0xfffffe000859f700
                                  calltrap() at calltrap+0x8/frame 0xfffffe000859f700
                                  --- trap 0xc, rip = 0xffffffff80fb1c0a, rsp = 0xfffffe000859f7d0, rbp = 0xfffffe000859f920 ---
                                  pf_test_state_tcp() at pf_test_state_tcp+0x125a/frame 0xfffffe000859f920
                                  pf_test() at pf_test+0x1353/frame 0xfffffe000859fac0
                                  pf_check_in() at pf_check_in+0x27/frame 0xfffffe000859fae0
                                  pfil_mbuf_in() at pfil_mbuf_in+0x38/frame 0xfffffe000859fb10
                                  ip_input() at ip_input+0x3ae/frame 0xfffffe000859fb70
                                  netisr_dispatch_src() at netisr_dispatch_src+0x22c/frame 0xfffffe000859fbc0
                                  ether_demux() at ether_demux+0x149/frame 0xfffffe000859fbf0
                                  ether_nh_input() at ether_nh_input+0x36e/frame 0xfffffe000859fc50
                                  netisr_dispatch_src() at netisr_dispatch_src+0xaf/frame 0xfffffe000859fca0
                                  ether_input() at ether_input+0x69/frame 0xfffffe000859fd00
                                  iflib_rxeof() at iflib_rxeof+0xc46/frame 0xfffffe000859fe00
                                  _task_fn_rx() at _task_fn_rx+0x72/frame 0xfffffe000859fe40
                                  gtaskqueue_run_locked() at gtaskqueue_run_locked+0x14e/frame 0xfffffe000859fec0
                                  gtaskqueue_thread_loop() at gtaskqueue_thread_loop+0xc2/frame 0xfffffe000859fef0
                                  fork_exit() at fork_exit+0x7f/frame 0xfffffe000859ff30
                                  fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe000859ff30
                                  --- trap 0, rip = 0, rsp = 0, rbp = 0 ---
                                  

                                  So not the same issue.

                                  Seems similar to a few other bugs but not identical.
                                  The message buffer shows it failing back and forth between the nodes a few times was that expected?

                                  K 1 Reply Last reply Reply Quote 0
                                  • K
                                    kprovost @stephenw10
                                    last edited by

                                    @stephenw10 said in Random crash on latest 23.09.1:

                                    That last backtrace decodes to /var/jenkins/workspace/pfSense-Plus-snapshots-23_09_1-main/sources/FreeBSD-src-plus-RELENG_23_09_1/sys/netpfil/pf/pf.c:5743, which is in pf_test_state_tcp(), where it applies NAT. It likely means that the state has a NULL key (pf_kstate->key[]).
                                    It's not clear to me how that'd happen. Speculatively, perhaps there's a race on state insertion, or there's something wrong in the pfsync state transfer. A full core dump might be helpful here, if this can be reproduced.

                                    Y 1 Reply Last reply Reply Quote 0
                                    • Y
                                      Yathus @kprovost
                                      last edited by

                                      @kprovost how can i have a full core dump ?

                                      1 Reply Last reply Reply Quote 0
                                      • stephenw10S
                                        stephenw10 Netgate Administrator
                                        last edited by

                                        You can just set the ddb file to dump rather than textdump but you need enough SWAP space to dump to and 1GB probably isn't enough.

                                        So you can reinstall with more swap space or add SWAP somehow. For example: https://forum.netgate.com/post/1127502

                                        Y 1 Reply Last reply Reply Quote 0
                                        • Y
                                          Yathus @stephenw10
                                          last edited by

                                          @stephenw10 i add a second disk to VM and i have now a 12Go SWAP.

                                          My config was :

                                          #script kdb.enter.default=textdump set; capture on; run pfs ; capture off; textdump dump; reset
                                          

                                          Replaced by :

                                          script kdb.enter.default=bt ; show registers ; dump ; reset
                                          

                                          I reboot too.

                                          1 Reply Last reply Reply Quote 0
                                          • stephenw10S
                                            stephenw10 Netgate Administrator
                                            last edited by

                                            Great. You can check that it's working as expected by forcing a panic and seeing if the kernel core dump is created.

                                            Running: sysctl debug.kdb.panic=1 will panic the system immediately and should create the core dump.

                                            Steve

                                            Y 1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post
                                            Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.