• Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Search
  • Register
  • Login
Netgate Discussion Forum
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Search
  • Register
  • Login

pfSense+ VM encountered page fault -- submit crash dump to Netgate?

Scheduled Pinned Locked Moved Virtualization
12 Posts 3 Posters 641 Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • X
    xpxp2002
    last edited by Mar 29, 2022, 12:03 PM

    I was just watching TV (streaming) when the internet appeared to go offline. After about 2 minutes I got an alert email that pfSense just finished booting. There is a crash dump indicating "Fatal trap 12: page fault while in kernel mode".

    Is Netgate interested in collecting the crash dump data? I did not see anywhere specific to submit or upload this. But it is extremely unusual, in my experience, for pfSense to crash this way.

    1 Reply Last reply Reply Quote 0
    • S
      stephenw10 Netgate Administrator
      last edited by Mar 29, 2022, 6:53 PM

      The backtrace from the start of the panic dump in the ddb file is usually enough to recognize if it's something known. So like:

      db:0:kdb.enter.default>  bt
      Tracing pid 11 tid 100006 td 0xfffff80004341740
      kdb_enter() at kdb_enter+0x37/frame 0xfffffe000249fe50
      vpanic() at vpanic+0x197/frame 0xfffffe000249fea0
      panic() at panic+0x43/frame 0xfffffe000249ff00
      mca_intr() at mca_intr+0x9b/frame 0xfffffe000249ff20
      mchk_calltrap() at mchk_calltrap+0x8/frame 0xfffffe000249ff20
      --- trap 0x1c, rip = 0xffffffff815285f8, rsp = 0xfffffe003fb59970, rbp = 0xfffffe003fb59970 ---
      acpi_cpu_idle_mwait() at acpi_cpu_idle_mwait+0x68/frame 0xfffffe003fb59970
      acpi_cpu_idle() at acpi_cpu_idle+0x186/frame 0xfffffe003fb599b0
      cpu_idle_acpi() at cpu_idle_acpi+0x3e/frame 0xfffffe003fb599d0
      cpu_idle() at cpu_idle+0x9f/frame 0xfffffe003fb599f0
      sched_idletd() at sched_idletd+0x326/frame 0xfffffe003fb59ab0
      fork_exit() at fork_exit+0x7e/frame 0xfffffe003fb59af0
      fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe003fb59af0
      --- trap 0, rip = 0, rsp = 0, rbp = 0 ---
      db:0:kdb.enter.default>  ps
      

      The last output from the message buffer can also help.

      Steve

      X 1 Reply Last reply Mar 29, 2022, 6:56 PM Reply Quote 0
      • X
        xpxp2002 @stephenw10
        last edited by stephenw10 Mar 29, 2022, 8:11 PM Mar 29, 2022, 6:56 PM

        db:0:kdb.enter.default>  bt
        Tracing pid 0 tid 100117 td 0xfffff80005a00000
        kdb_enter() at kdb_enter+0x37/frame 0xfffffe002ba27600
        vpanic() at vpanic+0x197/frame 0xfffffe002ba27650
        panic() at panic+0x43/frame 0xfffffe002ba276b0
        trap_fatal() at trap_fatal+0x391/frame 0xfffffe002ba27710
        trap_pfault() at trap_pfault+0x4f/frame 0xfffffe002ba27760
        trap() at trap+0x286/frame 0xfffffe002ba27870
        calltrap() at calltrap+0x8/frame 0xfffffe002ba27870
        --- trap 0xc, rip = 0xffffffff81494527, rsp = 0xfffffe002ba27940, rbp = 0xfffffe002ba27980 ---
        wait_for_response() at wait_for_response+0x67/frame 0xfffffe002ba27980
        vmbus_pcib_attach() at vmbus_pcib_attach+0x38f/frame 0xfffffe002ba27a70
        device_attach() at device_attach+0x3dd/frame 0xfffffe002ba27ac0
        device_probe_and_attach() at device_probe_and_attach+0x41/frame 0xfffffe002ba27af0
        vmbus_add_child() at vmbus_add_child+0x79/frame 0xfffffe002ba27b20
        taskqueue_run_locked() at taskqueue_run_locked+0x144/frame 0xfffffe002ba27b80
        taskqueue_thread_loop() at taskqueue_thread_loop+0xb6/frame 0xfffffe002ba27bb0
        fork_exit() at fork_exit+0x7e/frame 0xfffffe002ba27bf0
        fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe002ba27bf0
        
        Fatal trap 12: page fault while in kernel mode
        cpuid = 2; apic id = 02
        fault virtual address	= 0x20
        fault code		= supervisor write data, page not present
        instruction pointer	= 0x20:0xffffffff81494527
        stack pointer	        = 0x28:0xfffffe002ba27940
        frame pointer	        = 0x28:0xfffffe002ba27980
        code segment		= base 0x0, limit 0xfffff, type 0x1b
        			= DPL 0, pres 1, long 1, def32 0, gran 1
        processor eflags	= interrupt enabled, resume, IOPL = 0
        current process		= 0 (vmbusdev)
        trap number		= 12
        panic: page fault
        cpuid = 2
        time = 1648555018
        KDB: enter: panic
        
        1 Reply Last reply Reply Quote 0
        • S
          stephenw10 Netgate Administrator
          last edited by Mar 29, 2022, 7:02 PM

          Hmm, nope not familiar with that.
          What hypervisor is it?
          Anything logged before it crashed?

          The backtrace makes it look like it's seeing new hardware, which seems odd.

          Steve

          X 1 Reply Last reply Mar 29, 2022, 7:18 PM Reply Quote 0
          • X
            xpxp2002 @stephenw10
            last edited by stephenw10 Mar 29, 2022, 8:12 PM Mar 29, 2022, 7:18 PM

            @stephenw10 Hyper-V on Windows Server 2019. There was no change in hardware.

            However, the first relevant event I see in the log is that the vNICs associated with the physical NIC detached. What's odd is that I don't see other VMs attached to the same VM switch reflecting a loss of link at that time, so it doesn't suggest to me that the physical NIC itself went through a reset or loss of connectivity. Here's the last system log prior to the string of events showing the loss of NICs.

            2022-03-29 07:50:00.046839-04:00	sshguard	65023	Now monitoring attacks.
            2022-03-29 07:56:48.362702-04:00	kernel	-	mlx4_core0: detached
            2022-03-29 07:56:48.362792-04:00	kernel	-	hn0: got notify, nvs type 128
            2022-03-29 07:56:49.403129-04:00	kernel	-	mlx4_core1: detached
            2022-03-29 07:56:49.403265-04:00	kernel	-	hn1: got notify, nvs type 128
            2022-03-29 07:56:50.873181-04:00	kernel	-	mlx4_core2: detached
            2022-03-29 07:56:50.873305-04:00	kernel	-	hn2: got notify, nvs type 128
            2022-03-29 07:56:52.563102-04:00	kernel	-	mlx4_core3: detached
            2022-03-29 07:56:52.563222-04:00	kernel	-	hn3: got notify, nvs type 128
            2022-03-29 07:56:54.055682-04:00	kernel	-	mlx4_core4: detached
            2022-03-29 07:56:54.055853-04:00	kernel	-	hn4: got notify, nvs type 128
            2022-03-29 07:56:55.353642-04:00	kernel	-	mlx4_core5: detached
            2022-03-29 07:56:55.563273-04:00	kernel	-	hn5: got notify, nvs type 128
            2022-03-29 07:56:57.048901-04:00	kernel	-	mlx4_core6: detached
            2022-03-29 07:56:57.049002-04:00	kernel	-	pci0: detached
            2022-03-29 07:56:57.049058-04:00	kernel	-	hn6: got notify, nvs type 128
            2022-03-29 07:56:57.049097-04:00	kernel	-	pcib0: detached
            2022-03-29 07:56:57.049125-04:00	kernel	-	pci1: detached
            2022-03-29 07:56:57.049229-04:00	kernel	-	pcib1: detached
            2022-03-29 07:56:57.049297-04:00	kernel	-	pci2: detached
            2022-03-29 07:56:57.049354-04:00	kernel	-	pcib2: detached
            2022-03-29 07:56:57.049413-04:00	kernel	-	pci3: detached
            2022-03-29 07:56:57.049487-04:00	kernel	-	pcib3: detached
            2022-03-29 07:56:57.049570-04:00	kernel	-	pci4: detached
            2022-03-29 07:56:57.049631-04:00	kernel	-	pcib4: detached
            2022-03-29 07:56:57.049693-04:00	kernel	-	pci5: detached
            2022-03-29 07:56:57.049762-04:00	kernel	-	pcib5: detached
            2022-03-29 07:56:57.049803-04:00	kernel	-	pci6: detached
            2022-03-29 07:56:57.049890-04:00	kernel	-	pcib6: detached
            2022-03-29 07:56:58.504205-04:00	kernel	-	hn0: got notify, nvs type 128
            2022-03-29 07:56:58.504319-04:00	kernel	-	hn1: got notify, nvs type 128
            2022-03-29 07:56:58.504385-04:00	kernel	-	hn2: got notify, nvs type 128
            2022-03-29 07:56:58.504457-04:00	kernel	-	hn3: got notify, nvs type 128
            2022-03-29 07:56:58.504517-04:00	kernel	-	hn4: got notify, nvs type 128
            2022-03-29 07:56:58.504575-04:00	kernel	-	hn5: got notify, nvs type 128
            2022-03-29 07:56:58.504644-04:00	kernel	-	hn6: got notify, nvs type 128
            2022-03-29 07:56:58.504899-04:00	kernel	-	pcib0: <Hyper-V PCI Express Pass Through> on vmbus0
            2022-03-29 07:56:58.504985-04:00	kernel	-	pci0: <PCI bus> on pcib0
            
            1 Reply Last reply Reply Quote 0
            • X
              xpxp2002
              last edited by Mar 29, 2022, 7:33 PM

              @stephenw10 As I'm looking through logs, it does look like there was at least a brief moment of loss of connectivity between the hypervisor and the switch. Still not clear why, though I'm leaning toward NIC or perhaps SFP module.

              Doesn't explain why pfSense ended up panicking, but making progress toward an initial cause.

              J 1 Reply Last reply Mar 29, 2022, 7:36 PM Reply Quote 0
              • J
                jimp Rebel Alliance Developer Netgate @xpxp2002
                last edited by Mar 29, 2022, 7:36 PM

                @xpxp2002 said in pfSense+ VM encountered page fault -- submit crash dump to Netgate?:

                @stephenw10 As I'm looking through logs, it does look like there was at least a brief moment of loss of connectivity between the hypervisor and the switch. Still not clear why, though I'm leaning toward NIC or perhaps SFP module.

                Doesn't explain why pfSense ended up panicking, but making progress toward an initial cause.

                It looks like the hypervisor tried to hotplug the actual PCI device, which isn't supported and has been known to flake out like that in the past for other similar events. It's more common on virtual hardware than real, but either way it's unsupported. The only safe way to add or remove hardware is while the VM is powered off. I wouldn't expect a hypervisor to remove the device from the VM like that, though, but that may be something you can configure in Hyper-V.

                Remember: Upvote with the 👍 button for any user/post you find to be helpful, informative, or deserving of recognition!

                Need help fast? Netgate Global Support!

                Do not Chat/PM for help!

                X 1 Reply Last reply Mar 29, 2022, 7:39 PM Reply Quote 1
                • X
                  xpxp2002 @jimp
                  last edited by Mar 29, 2022, 7:39 PM

                  @jimp Thanks. That definitely explains the panic, then. Now I just need to figure out why the hypervisor would attempt to hotplug the NIC during normal operation. It was up for at least 2 weeks, so not sure what event triggered that today.

                  1 Reply Last reply Reply Quote 0
                  • J
                    jimp Rebel Alliance Developer Netgate
                    last edited by Mar 29, 2022, 7:41 PM

                    What you said earlier is entirely possible about a potential problem with the card or SFP module, or maybe it's something Hyper-V is doing unexpectedly on link loss. It wouldn't be too hard to test those theories. Snapshot the VM first and then see what happens when you unplug/replug the cable, then try the SFP module. See what Hyper-V does and if it has another panic.

                    Remember: Upvote with the 👍 button for any user/post you find to be helpful, informative, or deserving of recognition!

                    Need help fast? Netgate Global Support!

                    Do not Chat/PM for help!

                    X 1 Reply Last reply Mar 29, 2022, 8:08 PM Reply Quote 1
                    • X
                      xpxp2002 @jimp
                      last edited by Mar 29, 2022, 8:08 PM

                      @jimp I will need to run through those steps after hours when I can take the outage.

                      I did find an interesting chain of events that may be related. The pfSense VM is replicated to another hypervisor, effectively in a cold standby configuration. That hypervisor with the replica had been undergoing a backup job, during which Hyper-V will not merge in new replica differences. A series of log entries on that hypervisor at the same time as the start of the NIC events on the primary/active VM suggests that the queue of replica updates reached a max threshold and required a resync of the VM from the primary.

                      'pfSense' requires resynchronization because it has reached the threshold for the size of accumulated log files.
                      

                      I'm removing this VM from the backup job on the replica hypervisor. I'm not sure why the resync would force the detach of the NIC, but that's also something I can try to research on the Hyper-V side or try to reproduce after hours.

                      I'm thinking that this resync may be the event that caused that hotplug event to occur on the hypervisor that had the running VM, given the timing of the logs. I maintain backups from within pfSense, so I don't really need the VM-level backup except to make it easy to layout the virtual hardware config (vNIC MAC addresses and whatnot), so I'm not concerned about excluding this VM from backup elsewhere.

                      1 Reply Last reply Reply Quote 1
                      • S
                        stephenw10 Netgate Administrator
                        last edited by Mar 29, 2022, 8:15 PM

                        Mmm, I note that's not just the NICs that were detached the actual PCI buses were detached. That should never happen. It could never happen on real hardware.

                        Steve

                        X 1 Reply Last reply Mar 29, 2022, 8:19 PM Reply Quote 1
                        • S stephenw10 moved this topic from General pfSense Questions on Mar 29, 2022, 8:16 PM
                        • X
                          xpxp2002 @stephenw10
                          last edited by Mar 29, 2022, 8:19 PM

                          @stephenw10 Oh wow. I misunderstood. Assumed this was the NIC detaching because the whole bus shouldn't be possible, as you said.

                          I need to dig into the Hyper-V side of this to understand that. That's absolutely baffling behavior. But hey, Microsoft...

                          1 Reply Last reply Reply Quote 0
                          1 out of 12
                          • First post
                            1/12
                            Last post
                          Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.
                            This community forum collects and processes your personal information.
                            consent.not_received