• Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Search
  • Register
  • Login
Netgate Discussion Forum
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Search
  • Register
  • Login

pfSense+ VM encountered page fault -- submit crash dump to Netgate?

Scheduled Pinned Locked Moved Virtualization
12 Posts 3 Posters 655 Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • X
    xpxp2002 @stephenw10
    last edited by stephenw10 Mar 29, 2022, 8:11 PM Mar 29, 2022, 6:56 PM

    db:0:kdb.enter.default>  bt
    Tracing pid 0 tid 100117 td 0xfffff80005a00000
    kdb_enter() at kdb_enter+0x37/frame 0xfffffe002ba27600
    vpanic() at vpanic+0x197/frame 0xfffffe002ba27650
    panic() at panic+0x43/frame 0xfffffe002ba276b0
    trap_fatal() at trap_fatal+0x391/frame 0xfffffe002ba27710
    trap_pfault() at trap_pfault+0x4f/frame 0xfffffe002ba27760
    trap() at trap+0x286/frame 0xfffffe002ba27870
    calltrap() at calltrap+0x8/frame 0xfffffe002ba27870
    --- trap 0xc, rip = 0xffffffff81494527, rsp = 0xfffffe002ba27940, rbp = 0xfffffe002ba27980 ---
    wait_for_response() at wait_for_response+0x67/frame 0xfffffe002ba27980
    vmbus_pcib_attach() at vmbus_pcib_attach+0x38f/frame 0xfffffe002ba27a70
    device_attach() at device_attach+0x3dd/frame 0xfffffe002ba27ac0
    device_probe_and_attach() at device_probe_and_attach+0x41/frame 0xfffffe002ba27af0
    vmbus_add_child() at vmbus_add_child+0x79/frame 0xfffffe002ba27b20
    taskqueue_run_locked() at taskqueue_run_locked+0x144/frame 0xfffffe002ba27b80
    taskqueue_thread_loop() at taskqueue_thread_loop+0xb6/frame 0xfffffe002ba27bb0
    fork_exit() at fork_exit+0x7e/frame 0xfffffe002ba27bf0
    fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe002ba27bf0
    
    Fatal trap 12: page fault while in kernel mode
    cpuid = 2; apic id = 02
    fault virtual address	= 0x20
    fault code		= supervisor write data, page not present
    instruction pointer	= 0x20:0xffffffff81494527
    stack pointer	        = 0x28:0xfffffe002ba27940
    frame pointer	        = 0x28:0xfffffe002ba27980
    code segment		= base 0x0, limit 0xfffff, type 0x1b
    			= DPL 0, pres 1, long 1, def32 0, gran 1
    processor eflags	= interrupt enabled, resume, IOPL = 0
    current process		= 0 (vmbusdev)
    trap number		= 12
    panic: page fault
    cpuid = 2
    time = 1648555018
    KDB: enter: panic
    
    1 Reply Last reply Reply Quote 0
    • S
      stephenw10 Netgate Administrator
      last edited by Mar 29, 2022, 7:02 PM

      Hmm, nope not familiar with that.
      What hypervisor is it?
      Anything logged before it crashed?

      The backtrace makes it look like it's seeing new hardware, which seems odd.

      Steve

      X 1 Reply Last reply Mar 29, 2022, 7:18 PM Reply Quote 0
      • X
        xpxp2002 @stephenw10
        last edited by stephenw10 Mar 29, 2022, 8:12 PM Mar 29, 2022, 7:18 PM

        @stephenw10 Hyper-V on Windows Server 2019. There was no change in hardware.

        However, the first relevant event I see in the log is that the vNICs associated with the physical NIC detached. What's odd is that I don't see other VMs attached to the same VM switch reflecting a loss of link at that time, so it doesn't suggest to me that the physical NIC itself went through a reset or loss of connectivity. Here's the last system log prior to the string of events showing the loss of NICs.

        2022-03-29 07:50:00.046839-04:00	sshguard	65023	Now monitoring attacks.
        2022-03-29 07:56:48.362702-04:00	kernel	-	mlx4_core0: detached
        2022-03-29 07:56:48.362792-04:00	kernel	-	hn0: got notify, nvs type 128
        2022-03-29 07:56:49.403129-04:00	kernel	-	mlx4_core1: detached
        2022-03-29 07:56:49.403265-04:00	kernel	-	hn1: got notify, nvs type 128
        2022-03-29 07:56:50.873181-04:00	kernel	-	mlx4_core2: detached
        2022-03-29 07:56:50.873305-04:00	kernel	-	hn2: got notify, nvs type 128
        2022-03-29 07:56:52.563102-04:00	kernel	-	mlx4_core3: detached
        2022-03-29 07:56:52.563222-04:00	kernel	-	hn3: got notify, nvs type 128
        2022-03-29 07:56:54.055682-04:00	kernel	-	mlx4_core4: detached
        2022-03-29 07:56:54.055853-04:00	kernel	-	hn4: got notify, nvs type 128
        2022-03-29 07:56:55.353642-04:00	kernel	-	mlx4_core5: detached
        2022-03-29 07:56:55.563273-04:00	kernel	-	hn5: got notify, nvs type 128
        2022-03-29 07:56:57.048901-04:00	kernel	-	mlx4_core6: detached
        2022-03-29 07:56:57.049002-04:00	kernel	-	pci0: detached
        2022-03-29 07:56:57.049058-04:00	kernel	-	hn6: got notify, nvs type 128
        2022-03-29 07:56:57.049097-04:00	kernel	-	pcib0: detached
        2022-03-29 07:56:57.049125-04:00	kernel	-	pci1: detached
        2022-03-29 07:56:57.049229-04:00	kernel	-	pcib1: detached
        2022-03-29 07:56:57.049297-04:00	kernel	-	pci2: detached
        2022-03-29 07:56:57.049354-04:00	kernel	-	pcib2: detached
        2022-03-29 07:56:57.049413-04:00	kernel	-	pci3: detached
        2022-03-29 07:56:57.049487-04:00	kernel	-	pcib3: detached
        2022-03-29 07:56:57.049570-04:00	kernel	-	pci4: detached
        2022-03-29 07:56:57.049631-04:00	kernel	-	pcib4: detached
        2022-03-29 07:56:57.049693-04:00	kernel	-	pci5: detached
        2022-03-29 07:56:57.049762-04:00	kernel	-	pcib5: detached
        2022-03-29 07:56:57.049803-04:00	kernel	-	pci6: detached
        2022-03-29 07:56:57.049890-04:00	kernel	-	pcib6: detached
        2022-03-29 07:56:58.504205-04:00	kernel	-	hn0: got notify, nvs type 128
        2022-03-29 07:56:58.504319-04:00	kernel	-	hn1: got notify, nvs type 128
        2022-03-29 07:56:58.504385-04:00	kernel	-	hn2: got notify, nvs type 128
        2022-03-29 07:56:58.504457-04:00	kernel	-	hn3: got notify, nvs type 128
        2022-03-29 07:56:58.504517-04:00	kernel	-	hn4: got notify, nvs type 128
        2022-03-29 07:56:58.504575-04:00	kernel	-	hn5: got notify, nvs type 128
        2022-03-29 07:56:58.504644-04:00	kernel	-	hn6: got notify, nvs type 128
        2022-03-29 07:56:58.504899-04:00	kernel	-	pcib0: <Hyper-V PCI Express Pass Through> on vmbus0
        2022-03-29 07:56:58.504985-04:00	kernel	-	pci0: <PCI bus> on pcib0
        
        1 Reply Last reply Reply Quote 0
        • X
          xpxp2002
          last edited by Mar 29, 2022, 7:33 PM

          @stephenw10 As I'm looking through logs, it does look like there was at least a brief moment of loss of connectivity between the hypervisor and the switch. Still not clear why, though I'm leaning toward NIC or perhaps SFP module.

          Doesn't explain why pfSense ended up panicking, but making progress toward an initial cause.

          J 1 Reply Last reply Mar 29, 2022, 7:36 PM Reply Quote 0
          • J
            jimp Rebel Alliance Developer Netgate @xpxp2002
            last edited by Mar 29, 2022, 7:36 PM

            @xpxp2002 said in pfSense+ VM encountered page fault -- submit crash dump to Netgate?:

            @stephenw10 As I'm looking through logs, it does look like there was at least a brief moment of loss of connectivity between the hypervisor and the switch. Still not clear why, though I'm leaning toward NIC or perhaps SFP module.

            Doesn't explain why pfSense ended up panicking, but making progress toward an initial cause.

            It looks like the hypervisor tried to hotplug the actual PCI device, which isn't supported and has been known to flake out like that in the past for other similar events. It's more common on virtual hardware than real, but either way it's unsupported. The only safe way to add or remove hardware is while the VM is powered off. I wouldn't expect a hypervisor to remove the device from the VM like that, though, but that may be something you can configure in Hyper-V.

            Remember: Upvote with the 👍 button for any user/post you find to be helpful, informative, or deserving of recognition!

            Need help fast? Netgate Global Support!

            Do not Chat/PM for help!

            X 1 Reply Last reply Mar 29, 2022, 7:39 PM Reply Quote 1
            • X
              xpxp2002 @jimp
              last edited by Mar 29, 2022, 7:39 PM

              @jimp Thanks. That definitely explains the panic, then. Now I just need to figure out why the hypervisor would attempt to hotplug the NIC during normal operation. It was up for at least 2 weeks, so not sure what event triggered that today.

              1 Reply Last reply Reply Quote 0
              • J
                jimp Rebel Alliance Developer Netgate
                last edited by Mar 29, 2022, 7:41 PM

                What you said earlier is entirely possible about a potential problem with the card or SFP module, or maybe it's something Hyper-V is doing unexpectedly on link loss. It wouldn't be too hard to test those theories. Snapshot the VM first and then see what happens when you unplug/replug the cable, then try the SFP module. See what Hyper-V does and if it has another panic.

                Remember: Upvote with the 👍 button for any user/post you find to be helpful, informative, or deserving of recognition!

                Need help fast? Netgate Global Support!

                Do not Chat/PM for help!

                X 1 Reply Last reply Mar 29, 2022, 8:08 PM Reply Quote 1
                • X
                  xpxp2002 @jimp
                  last edited by Mar 29, 2022, 8:08 PM

                  @jimp I will need to run through those steps after hours when I can take the outage.

                  I did find an interesting chain of events that may be related. The pfSense VM is replicated to another hypervisor, effectively in a cold standby configuration. That hypervisor with the replica had been undergoing a backup job, during which Hyper-V will not merge in new replica differences. A series of log entries on that hypervisor at the same time as the start of the NIC events on the primary/active VM suggests that the queue of replica updates reached a max threshold and required a resync of the VM from the primary.

                  'pfSense' requires resynchronization because it has reached the threshold for the size of accumulated log files.
                  

                  I'm removing this VM from the backup job on the replica hypervisor. I'm not sure why the resync would force the detach of the NIC, but that's also something I can try to research on the Hyper-V side or try to reproduce after hours.

                  I'm thinking that this resync may be the event that caused that hotplug event to occur on the hypervisor that had the running VM, given the timing of the logs. I maintain backups from within pfSense, so I don't really need the VM-level backup except to make it easy to layout the virtual hardware config (vNIC MAC addresses and whatnot), so I'm not concerned about excluding this VM from backup elsewhere.

                  1 Reply Last reply Reply Quote 1
                  • S
                    stephenw10 Netgate Administrator
                    last edited by Mar 29, 2022, 8:15 PM

                    Mmm, I note that's not just the NICs that were detached the actual PCI buses were detached. That should never happen. It could never happen on real hardware.

                    Steve

                    X 1 Reply Last reply Mar 29, 2022, 8:19 PM Reply Quote 1
                    • S stephenw10 moved this topic from General pfSense Questions on Mar 29, 2022, 8:16 PM
                    • X
                      xpxp2002 @stephenw10
                      last edited by Mar 29, 2022, 8:19 PM

                      @stephenw10 Oh wow. I misunderstood. Assumed this was the NIC detaching because the whole bus shouldn't be possible, as you said.

                      I need to dig into the Hyper-V side of this to understand that. That's absolutely baffling behavior. But hey, Microsoft...

                      1 Reply Last reply Reply Quote 0
                      12 out of 12
                      • First post
                        12/12
                        Last post
                      Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.
                        This community forum collects and processes your personal information.
                        consent.not_received