Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    page fault kernel panics after 2.5.2 upgrade

    Scheduled Pinned Locked Moved General pfSense Questions
    crashkernel panic2.5.2
    25 Posts 4 Posters 4.4k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • stephenw10S
      stephenw10 Netgate Administrator
      last edited by

      The backtrace on those is nearly identical and not very helpful:

      db:0:kdb.enter.default>  bt
      Tracing pid 31667 tid 100251 td 0xfffff8015da47000
      kdb_enter() at kdb_enter+0x37/frame 0xfffffe009c9644f0
      vpanic() at vpanic+0x197/frame 0xfffffe009c964540
      panic() at panic+0x43/frame 0xfffffe009c9645a0
      trap_fatal() at trap_fatal+0x391/frame 0xfffffe009c964600
      trap_pfault() at trap_pfault+0x4f/frame 0xfffffe009c964650
      trap() at trap+0x286/frame 0xfffffe009c964760
      calltrap() at calltrap+0x8/frame 0xfffffe009c964760
      --- trap 0xc, rip = 0xffffffff8137093e, rsp = 0xfffffe009c964830, rbp = 0xfffffe009c964900 ---
      pmap_enter() at pmap_enter+0x96e/frame 0xfffffe009c964900
      vm_fault() at vm_fault+0x1aa5/frame 0xfffffe009c964a50
      vm_fault_trap() at vm_fault_trap+0x60/frame 0xfffffe009c964a90
      trap_pfault() at trap_pfault+0x19c/frame 0xfffffe009c964ae0
      trap() at trap+0x410/frame 0xfffffe009c964bf0
      calltrap() at calltrap+0x8/frame 0xfffffe009c964bf0
      --- trap 0xc, rip = 0x80028681b, rsp = 0x7fffffffe9a0, rbp = 0x7fffffffe9a0 ---
      db:0:kdb.enter.default>  ps
      

      The first crash appears to be in 2.5.1. The second one in 2.5.2 and nearlty identical so I don't think it's anything to do with the upgrade.

      The first thing I would do here is disable hardware features you don't need like the sound card and firewire. And the Atheros NIC? Looks like that is unused (down).

      Steve

      D 1 Reply Last reply Reply Quote 0
      • D
        doubledgedboard @stephenw10
        last edited by

        @stephenw10

        So what would be more useful for debugging the source?

        Based on the documentation, it says that page faults are usually a kernel issue (cause the system itself isn't going completely unresponsive, but it's still successfully saving a dump etc), and thus not a hardware issue.

        Are you saying it could still be a hardware issue?

        1 Reply Last reply Reply Quote 0
        • stephenw10S
          stephenw10 Netgate Administrator
          last edited by

          It could be. But comparing a number of back-traces would easily confirm if it's not.

          D 1 Reply Last reply Reply Quote 0
          • D
            doubledgedboard @stephenw10
            last edited by

            @stephenw10

            Okay thanks. I'll start digging into disabling the hardware you mentioned. The atheros NIC is a management port and usually disconnected.

            I attached my latest dump file for posterity

            info.2.tar
            textdump.2.tar

            stephenw10S 1 Reply Last reply Reply Quote 0
            • stephenw10S
              stephenw10 Netgate Administrator @doubledgedboard
              last edited by

              Mmm, that's even less helpful unfortunately:

              db:0:kdb.enter.default>  bt
              Tracing pid 98065 tid 100299 td 0xfffff800b9c90000
              kdb_enter() at kdb_enter+0x37/frame 0xfffffe009e840980
              vpanic() at vpanic+0x197/frame 0xfffffe009e8409d0
              panic() at panic+0x43/frame 0xfffffe009e840a30
              trap_fatal() at trap_fatal+0x391/frame 0xfffffe009e840a90
              trap_pfault() at trap_pfault+0x4f/frame 0xfffffe009e840ae0
              trap() at trap+0x410/frame 0xfffffe009e840bf0
              calltrap() at calltrap+0x8/frame 0xfffffe009e840bf0
              --- trap 0xc, rip = 0x8002867c0, rsp = 0x7fffffffe9a0, rbp = 0x7fffffffe9a0 ---
              db:0:kdb.enter.default>  ps
              

              And nothing significant in the msg buffer either.

              1 Reply Last reply Reply Quote 0
              • D
                doubledgedboard
                last edited by

                adding another crash dump for posterity -- still investigating

                textdump.3.tar

                1 Reply Last reply Reply Quote 0
                • D
                  doubledgedboard
                  last edited by

                  I did some general research into how to debug this and one of the things I encountered suggested having the symbols for the kernel present (in /boot/kernel/) as well as the sources in (/usr/src), both of which don't seem to be present in my pfSense install.

                  So I'm trying to figure out how to address that to see if I can improve the ability to debug or at least provide a useful backtrace from these dumps.

                  1 Reply Last reply Reply Quote 0
                  • stephenw10S
                    stephenw10 Netgate Administrator
                    last edited by

                    Well they do all look very similar at least. That implies it's probably not a hardware issue.

                    One thing you could try here is loading the debug kernel:
                    https://files00.netgate.com/packages/pfSense_v2_5_2_amd64-core/All/pfSense-kernel-debug-pfSense-2.5.2.r.20210613.1712.txz

                    But be aware almost no-one is running that. You may well see other issues. I would not recommend running that on a production firewall.

                    Steve

                    1 Reply Last reply Reply Quote 1
                    • D
                      doubledgedboard
                      last edited by

                      Well I haven't had any more crashes for the past few days...

                      What did I do?

                      Unplugged the keyboard/mouse and monitor cable. I suspect one of those peripherals was leading to an occasional hiccup. I don't have a KVM so I have a single KB/Mouse for two server machines, and usually I'm manually swapping them around on occasion.

                      My theory is that I was usually leaving it connected to the other server, but changed and left it connected to the router server, and perhaps this was leading to crashes over time due to instability with the peripherals or the video driver.

                      Anyway, hopefully I don't post in this again which means that was the problem and I solved it, otherwise I'll post back again if it wasn't ๐Ÿ˜†

                      1 Reply Last reply Reply Quote 0
                      • stephenw10S
                        stephenw10 Netgate Administrator
                        last edited by

                        Hmm, well that would be odd but one of those troubleshooting cases where the cause comes from some seemingly unrelated thing. Leaky microwave, vacuum cleaner in the UPS etc. ๐Ÿ˜‰

                        D 1 Reply Last reply Reply Quote 0
                        • D
                          doubledgedboard @stephenw10
                          last edited by

                          @stephenw10

                          Whelp it crashed again, I guess it was wishful thinking after all. I got lucky with a few days without crashes ๐Ÿ˜ข

                          1 Reply Last reply Reply Quote 0
                          • stephenw10S
                            stephenw10 Netgate Administrator
                            last edited by

                            Same backtrace?

                            D 1 Reply Last reply Reply Quote 0
                            • D
                              doubledgedboard @stephenw10
                              last edited by

                              @stephenw10 Yeah probably, attached

                              db:0:kdb.enter.default>  bt
                              Tracing pid 63867 tid 100290 td 0xfffff8009646f740
                              kdb_enter() at kdb_enter+0x37/frame 0xfffffe009e822980
                              vpanic() at vpanic+0x197/frame 0xfffffe009e8229d0
                              panic() at panic+0x43/frame 0xfffffe009e822a30
                              trap_fatal() at trap_fatal+0x391/frame 0xfffffe009e822a90
                              trap_pfault() at trap_pfault+0x4f/frame 0xfffffe009e822ae0
                              trap() at trap+0x410/frame 0xfffffe009e822bf0
                              calltrap() at calltrap+0x8/frame 0xfffffe009e822bf0
                              --- trap 0xc, rip = 0x8002867c0, rsp = 0x7fffffffe9a0, rbp = 0x7fffffffe9a0 ---
                              

                              textdump.4.tar

                              1 Reply Last reply Reply Quote 0
                              • stephenw10S
                                stephenw10 Netgate Administrator
                                last edited by

                                Mmm, still nothing leading up to the trap and nothing show on the console.
                                Hard to say what that might be with nothing to go on really. ๐Ÿ˜•

                                1 Reply Last reply Reply Quote 0
                                • D
                                  doubledgedboard
                                  last edited by

                                  I may have solved the issue, although I'm probably tempting fate by claiming it so soon.

                                  The issue persisted for some time, at first it was very periodic, approximately three days between panics, which is why I wasn't completely sold on a hardware problem yet.

                                  I tried seeing if restarting "ahead of schedule" would give me three extra days (from last normal restart), but it still panic'd only a day later.

                                  Eventually it naturally restarted sooner than three days.

                                  Last night it started restarting every few minutes, and then suddenly it was restarting before it could even finish booting.

                                  Aha!

                                  Classic symptoms of a power supply issue...

                                  I replaced the PSU (circa 2004) and it's been online ever since. I'll check back in a week and if it still hasn't panicked then I'll call that the issue.

                                  D 1 Reply Last reply Reply Quote 0
                                  • D
                                    doubledgedboard @doubledgedboard
                                    last edited by

                                    I just can't win...

                                    It rebooted last night. It wasn't the power supply.

                                    D 1 Reply Last reply Reply Quote 0
                                    • D
                                      doubledgedboard @doubledgedboard
                                      last edited by

                                      I'm about to hit 7 days uptime so I think I finally found the issue.

                                      I started pulling memory sticks out one by one and waiting for it to restart.

                                      I suspect I have at least one bad stick of ram.

                                      Posting this for posterity for anyone else who runs into this type of issue.

                                      MrPeteM 1 Reply Last reply Reply Quote 1
                                      • MrPeteM
                                        MrPete @doubledgedboard
                                        last edited by

                                        @doubledgedboard For future browsers: it's always a good idea to do an intense RAM test.

                                        FWIW, the folks at memtest86 dot com have recently done massive updates / upgrades to the (free) RAM tester.

                                        I recently had a situation where RAM passed a few-years-old version of memtest... but with the latest version, it immediately was detected as bad.

                                        I strongly encourage everybody to grab a current version :)

                                        D 1 Reply Last reply Reply Quote 1
                                        • D
                                          doubledgedboard @MrPete
                                          last edited by doubledgedboard

                                          @mrpete Oh for sure, I've been using memtest and variants for years

                                          the issue here is that the system required near 24/7 uptime and I didn't have the time to take it down to run 8+ hour long memory tests, so I had to do what I could while maintaining uptime

                                          (and for posterity, I'm back up to 88 days of uptime now ๐Ÿ˜„ )

                                          1 Reply Last reply Reply Quote 0
                                          • S
                                            Schoolofhardknocks
                                            last edited by

                                            I had something similar that happened to me and it happened during the boot sequence, so I had to reinstall pfsense altogether because I couldn't finish booting or restore a recent configuration, but the dump was more...

                                            Tracing pid 431 tid 100111 td 0xfffff800055f2000
                                            kdb_enter() at kdb_enter+0x37/frame 0xfffffe00005a4620
                                            vpanic() at vpanic+0x197/frame 0xfffffe00005a4670
                                            panic() at panic+0x43/frame 0xfffffe00005a46d0
                                            ffs_valloc() at ffs_valloc+0x8f3/frame 0xfffffe00005a4760
                                            ufs_makeinode() at ufs_makeinode+0xa3/frame 0xfffffe00005a48f0
                                            ufs_create() at ufs_create+0x34/frame 0xfffffe00005a4910
                                            VOP_CREATE_APV() at VOP_CREATE_APV+0x75/frame 0xfffffe00005a4940
                                            vn_open_cred() at vn_open_cred+0x2d9/frame 0xfffffe00005a4a90
                                            kern_openat() at kern_openat+0x213/frame 0xfffffe00005a4c00
                                            amd64_syscall() at amd64_syscall+0x387/frame 0xfffffe00005a4d30
                                            fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00005a4d30
                                            --- syscall (5, FreeBSD ELF64, sys_open), rip = 0x800b34e0a, rsp = 0x7fffffffd168, rbp = 0x7fffffffd1a0 ---

                                            Then it proceeded with...

                                            Tracing command sleep pid 96166 tid 100128 td 0xfffff800056c7740
                                            sched_switch() at sched_switch+0x630/frame 0xfffffe00005f9a00
                                            mi_switch() at mi_switch+0xd4/frame 0xfffffe00005f9a30
                                            sleepq_catch_signals() at sleepq_catch_signals+0x403/frame 0xfffffe00005f9a80
                                            sleepq_timedwait_sig() at sleepq_timedwait_sig+0x14/frame 0xfffffe00005f9ac0
                                            _sleep() at _sleep+0x1b3/frame 0xfffffe00005f9b40
                                            kern_clock_nanosleep() at kern_clock_nanosleep+0x1d2/frame 0xfffffe00005f9bc0
                                            sys_nanosleep() at sys_nanosleep+0x3b/frame 0xfffffe00005f9c00
                                            amd64_syscall() at amd64_syscall+0x387/frame 0xfffffe00005f9d30
                                            fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00005f9d30
                                            --- syscall (240, FreeBSD ELF64, sys_nanosleep), rip = 0x80038b6aa, rsp = 0x7fffffffec18, rbp = 0x7fffffffec60 ---

                                            It repeated these same sleep system calls for a long time and then forced a reboot.
                                            After reinstalling everything the crash was occurring within the pfSense webConfigurator. Its still happening, not sure why, but its not rebooting my box anymore..at least.

                                            S 1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post
                                            Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.