Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    Crashdump Master FW rebooting

    Scheduled Pinned Locked Moved General pfSense Questions
    14 Posts 2 Posters 1.4k Views 3 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • R Offline
      returntrip
      last edited by returntrip

      Since yesterday our Master FW is rebooting randomly. In the crash dump I see a Kernel panic. Is someone able to help me understand what is gong on please?cd.zip

      1 Reply Last reply Reply Quote 0
      • stephenw10S Offline
        stephenw10 Netgate Administrator
        last edited by

        The backtrace shows:

        db:0:kdb.enter.default>  bt
        Tracing pid 0 tid 100031 td 0xfffff800055ee000
        kdb_enter() at kdb_enter+0x37/frame 0xfffffe0000442f30
        vpanic() at vpanic+0x197/frame 0xfffffe0000442f80
        panic() at panic+0x43/frame 0xfffffe0000442fe0
        trap_fatal() at trap_fatal+0x391/frame 0xfffffe0000443040
        trap_pfault() at trap_pfault+0x4f/frame 0xfffffe0000443090
        trap() at trap+0x286/frame 0xfffffe00004431a0
        calltrap() at calltrap+0x8/frame 0xfffffe00004431a0
        --- trap 0xc, rip = 0xffffffff8109908f, rsp = 0xfffffe0000443270, rbp = 0xfffffe0000443380 ---
        pf_test_state_tcp() at pf_test_state_tcp+0x156f/frame 0xfffffe0000443380
        pf_test() at pf_test+0x1fc3/frame 0xfffffe00004435d0
        pf_check_in() at pf_check_in+0x1d/frame 0xfffffe00004435f0
        pfil_run_hooks() at pfil_run_hooks+0xa1/frame 0xfffffe0000443690
        ip_input() at ip_input+0x475/frame 0xfffffe0000443740
        netisr_dispatch_src() at netisr_dispatch_src+0xca/frame 0xfffffe0000443790
        ether_demux() at ether_demux+0x16a/frame 0xfffffe00004437c0
        ether_nh_input() at ether_nh_input+0x330/frame 0xfffffe0000443820
        netisr_dispatch_src() at netisr_dispatch_src+0xca/frame 0xfffffe0000443870
        ether_input() at ether_input+0x4b/frame 0xfffffe00004438a0
        vlan_input() at vlan_input+0x1f3/frame 0xfffffe00004438f0
        ether_demux() at ether_demux+0x153/frame 0xfffffe0000443920
        ether_nh_input() at ether_nh_input+0x330/frame 0xfffffe0000443980
        netisr_dispatch_src() at netisr_dispatch_src+0xca/frame 0xfffffe00004439d0
        ether_input() at ether_input+0x4b/frame 0xfffffe0000443a00
        iflib_rxeof() at iflib_rxeof+0xae6/frame 0xfffffe0000443ae0
        _task_fn_rx() at _task_fn_rx+0x72/frame 0xfffffe0000443b20
        gtaskqueue_run_locked() at gtaskqueue_run_locked+0x121/frame 0xfffffe0000443b80
        gtaskqueue_thread_loop() at gtaskqueue_thread_loop+0xb6/frame 0xfffffe0000443bb0
        fork_exit() at fork_exit+0x7e/frame 0xfffffe0000443bf0
        fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0000443bf0
        --- trap 0, rip = 0, rsp = 0, rbp = 0 ---
        

        Panic is:

        Fatal trap 12: page fault while in kernel mode
        cpuid = 1; apic id = 02
        fault virtual address	= 0x0
        fault code		= supervisor read data, page not present
        instruction pointer	= 0x20:0xffffffff8109908f
        stack pointer	        = 0x28:0xfffffe0000443270
        frame pointer	        = 0x28:0xfffffe0000443380
        code segment		= base 0x0, limit 0xfffff, type 0x1b
        			= DPL 0, pres 1, long 1, def32 0, gran 1
        processor eflags	= interrupt enabled, resume, IOPL = 0
        current process		= 0 (if_io_tqg_1)
        trap number		= 12
        panic: page fault
        cpuid = 1
        time = 1639576714
        KDB: enter: panic
        

        Unfortunately neither is very specific.

        Did this just start happening? Seemingly spontaneously?

        If the other node in this pair the same hardware? Is it also panicking?

        Steve

        R 1 Reply Last reply Reply Quote 0
        • R Offline
          returntrip @stephenw10
          last edited by

          @stephenw10 This issue started happening after the sync started complaining and we started getting lots of messages like this.

          14:52:52 A communications error occurred while attempting to call XMLRPC method restore_config_section:

          We then compared the primary and secondary firewall and noticed several config bit being out of sync (especially the users being duplicated on the backup FW and the openvpn setting on the same FW not being identical)

          My colleague then disabled users/group sync, deleted the items from the secondary FW and re enabled the sync. All then seemed to synch correctly. But after a while the Master FW started rebooting by itself.

          I did compare the configuration and found that some Openvpn interfaces were slightly different between the two FWs. The Master FW was showing a MAC address for the ovpns2/ovpns3 interface whilst the seconday FW did not show a MAC address:

          Master FW opvns3 interface:
          a2677ffb-4bed-468f-9e59-0ac02425c068-image.png

          Secondary FW opvns2 interface:
          343951e6-7d7a-4060-b9f5-84a5a1eb1e9c-image.png

          Maybe above is unrelated but that was a strange difference also cause those interfaces on the master FW were showing a "Speed and Duplex" drop down menu (not shown on the Backup FW).

          What we have done now is this:

          1. Delete those ovpns2/3 interfaces (TAP)
          2. Upgraded the FW networkin inerfaces connecting to our core distribution switches to 10Gbps SFP interfaces

          We are monitoring to see if we get random reboots.

          1 Reply Last reply Reply Quote 0
          • stephenw10S Offline
            stephenw10 Netgate Administrator
            last edited by

            Mmm, the config should be identical otherwise the state sync will be incorrect. That's unlikely to really be an issue on OpenVPN servers though as only one can ever be active and clients have to reconnect at failover anyway.

            The presence of a MAC address indicates the server is running in TAP mode. They should both have one if both are TAP mode servers and both are running.

            Steve

            R 1 Reply Last reply Reply Quote 0
            • R Offline
              returntrip @stephenw10
              last edited by returntrip

              @stephenw10 The backup server never had the MAC for the TAP interfaces even when it was MASTER. I dunno why...

              The other funny thing is this:

              We took another Dell R210 server and used the SSD with pfSense from the primary FW, the primary FW would reboot anyway. So it was not an HW issue cause it was a completely "new" server running the exact same pfSense. install.

              The secondary FW has been running for 114 days straight (since the updated to 2.5.2 think), on the same Server HW model.

              I think we are planning to buy Netgate HW next year (new budget). But it would be really great to get to the bottom of this.

              1 Reply Last reply Reply Quote 0
              • stephenw10S Offline
                stephenw10 Netgate Administrator
                last edited by

                Hmm, it looks almost identical to this: https://redmine.pfsense.org/issues/5473
                But that was fixed years ago.

                I assume you had been running 2.5.2 for a while before this started?

                Did you make any other sort of change that coincided with it starting?

                Steve

                R 1 Reply Last reply Reply Quote 0
                • R Offline
                  returntrip @stephenw10
                  last edited by returntrip

                  @stephenw10 yeah we have been running 2.5.2 for months before this issue started.

                  By looking at the auto backup service on the primary FW, I cannot spot any substantial changes (I think I only see a change to the ovpns interface).

                  On the seconday FW instead i had added the ovpns interfaces and bridges for the respective LAN interfaces.

                  1 Reply Last reply Reply Quote 0
                  • stephenw10S Offline
                    stephenw10 Netgate Administrator
                    last edited by

                    How often are you seeing the crashes?

                    You might try changing isrdispatch from direct to deferred since that's the code path that seems to be triggering it.
                    See: https://docs.netgate.com/pfsense/en/latest/hardware/tune.html#pppoe-with-multi-queue-nics

                    Steve

                    R 1 Reply Last reply Reply Quote 0
                    • R Offline
                      returntrip @stephenw10
                      last edited by

                      @stephenw10 The crashes were random, sometimes happened after hours some times after minutes, it crashed about 8 times within say 24h.

                      The crashes stopped after removing the TAP interfaces and after upgrading the network cards to SFP hence I did not follow your suggetion re isrdispatch.

                      1 Reply Last reply Reply Quote 0
                      • stephenw10S Offline
                        stephenw10 Netgate Administrator
                        last edited by

                        Hmm, interesting.
                        Good to know for future reference. Thanks for following up.

                        Steve

                        R 2 Replies Last reply Reply Quote 0
                        • R Offline
                          returntrip @stephenw10
                          last edited by

                          @stephenw10 I might re add the TAP interfaces after the xmas break and report back any issues

                          1 Reply Last reply Reply Quote 1
                          • R Offline
                            returntrip @stephenw10
                            last edited by

                            @stephenw10 One question, a bit unrelated perhaps to this issue. Is it normal for the firewall WEBUI to stall when adding VLAN and/or interfaces in a HA setup?

                            1 Reply Last reply Reply Quote 0
                            • stephenw10S Offline
                              stephenw10 Netgate Administrator
                              last edited by

                              It depends what you mean by stall. Adding a new interface can trigger quite a few things especially in an HA pair but I wouldn't expect it to take very much longer than any other change.

                              Steve

                              R 1 Reply Last reply Reply Quote 1
                              • R Offline
                                returntrip @stephenw10
                                last edited by

                                @stephenw10 Next time i add an interface/VLAN I will time it and let you know

                                1 Reply Last reply Reply Quote 0
                                • First post
                                  Last post
                                Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.