Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    pfSense+ VM encountered page fault -- submit crash dump to Netgate?

    Scheduled Pinned Locked Moved Virtualization
    12 Posts 3 Posters 641 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • X
      xpxp2002 @stephenw10
      last edited by stephenw10

      db:0:kdb.enter.default>  bt
      Tracing pid 0 tid 100117 td 0xfffff80005a00000
      kdb_enter() at kdb_enter+0x37/frame 0xfffffe002ba27600
      vpanic() at vpanic+0x197/frame 0xfffffe002ba27650
      panic() at panic+0x43/frame 0xfffffe002ba276b0
      trap_fatal() at trap_fatal+0x391/frame 0xfffffe002ba27710
      trap_pfault() at trap_pfault+0x4f/frame 0xfffffe002ba27760
      trap() at trap+0x286/frame 0xfffffe002ba27870
      calltrap() at calltrap+0x8/frame 0xfffffe002ba27870
      --- trap 0xc, rip = 0xffffffff81494527, rsp = 0xfffffe002ba27940, rbp = 0xfffffe002ba27980 ---
      wait_for_response() at wait_for_response+0x67/frame 0xfffffe002ba27980
      vmbus_pcib_attach() at vmbus_pcib_attach+0x38f/frame 0xfffffe002ba27a70
      device_attach() at device_attach+0x3dd/frame 0xfffffe002ba27ac0
      device_probe_and_attach() at device_probe_and_attach+0x41/frame 0xfffffe002ba27af0
      vmbus_add_child() at vmbus_add_child+0x79/frame 0xfffffe002ba27b20
      taskqueue_run_locked() at taskqueue_run_locked+0x144/frame 0xfffffe002ba27b80
      taskqueue_thread_loop() at taskqueue_thread_loop+0xb6/frame 0xfffffe002ba27bb0
      fork_exit() at fork_exit+0x7e/frame 0xfffffe002ba27bf0
      fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe002ba27bf0
      
      Fatal trap 12: page fault while in kernel mode
      cpuid = 2; apic id = 02
      fault virtual address	= 0x20
      fault code		= supervisor write data, page not present
      instruction pointer	= 0x20:0xffffffff81494527
      stack pointer	        = 0x28:0xfffffe002ba27940
      frame pointer	        = 0x28:0xfffffe002ba27980
      code segment		= base 0x0, limit 0xfffff, type 0x1b
      			= DPL 0, pres 1, long 1, def32 0, gran 1
      processor eflags	= interrupt enabled, resume, IOPL = 0
      current process		= 0 (vmbusdev)
      trap number		= 12
      panic: page fault
      cpuid = 2
      time = 1648555018
      KDB: enter: panic
      
      1 Reply Last reply Reply Quote 0
      • stephenw10S
        stephenw10 Netgate Administrator
        last edited by

        Hmm, nope not familiar with that.
        What hypervisor is it?
        Anything logged before it crashed?

        The backtrace makes it look like it's seeing new hardware, which seems odd.

        Steve

        X 1 Reply Last reply Reply Quote 0
        • X
          xpxp2002 @stephenw10
          last edited by stephenw10

          @stephenw10 Hyper-V on Windows Server 2019. There was no change in hardware.

          However, the first relevant event I see in the log is that the vNICs associated with the physical NIC detached. What's odd is that I don't see other VMs attached to the same VM switch reflecting a loss of link at that time, so it doesn't suggest to me that the physical NIC itself went through a reset or loss of connectivity. Here's the last system log prior to the string of events showing the loss of NICs.

          2022-03-29 07:50:00.046839-04:00	sshguard	65023	Now monitoring attacks.
          2022-03-29 07:56:48.362702-04:00	kernel	-	mlx4_core0: detached
          2022-03-29 07:56:48.362792-04:00	kernel	-	hn0: got notify, nvs type 128
          2022-03-29 07:56:49.403129-04:00	kernel	-	mlx4_core1: detached
          2022-03-29 07:56:49.403265-04:00	kernel	-	hn1: got notify, nvs type 128
          2022-03-29 07:56:50.873181-04:00	kernel	-	mlx4_core2: detached
          2022-03-29 07:56:50.873305-04:00	kernel	-	hn2: got notify, nvs type 128
          2022-03-29 07:56:52.563102-04:00	kernel	-	mlx4_core3: detached
          2022-03-29 07:56:52.563222-04:00	kernel	-	hn3: got notify, nvs type 128
          2022-03-29 07:56:54.055682-04:00	kernel	-	mlx4_core4: detached
          2022-03-29 07:56:54.055853-04:00	kernel	-	hn4: got notify, nvs type 128
          2022-03-29 07:56:55.353642-04:00	kernel	-	mlx4_core5: detached
          2022-03-29 07:56:55.563273-04:00	kernel	-	hn5: got notify, nvs type 128
          2022-03-29 07:56:57.048901-04:00	kernel	-	mlx4_core6: detached
          2022-03-29 07:56:57.049002-04:00	kernel	-	pci0: detached
          2022-03-29 07:56:57.049058-04:00	kernel	-	hn6: got notify, nvs type 128
          2022-03-29 07:56:57.049097-04:00	kernel	-	pcib0: detached
          2022-03-29 07:56:57.049125-04:00	kernel	-	pci1: detached
          2022-03-29 07:56:57.049229-04:00	kernel	-	pcib1: detached
          2022-03-29 07:56:57.049297-04:00	kernel	-	pci2: detached
          2022-03-29 07:56:57.049354-04:00	kernel	-	pcib2: detached
          2022-03-29 07:56:57.049413-04:00	kernel	-	pci3: detached
          2022-03-29 07:56:57.049487-04:00	kernel	-	pcib3: detached
          2022-03-29 07:56:57.049570-04:00	kernel	-	pci4: detached
          2022-03-29 07:56:57.049631-04:00	kernel	-	pcib4: detached
          2022-03-29 07:56:57.049693-04:00	kernel	-	pci5: detached
          2022-03-29 07:56:57.049762-04:00	kernel	-	pcib5: detached
          2022-03-29 07:56:57.049803-04:00	kernel	-	pci6: detached
          2022-03-29 07:56:57.049890-04:00	kernel	-	pcib6: detached
          2022-03-29 07:56:58.504205-04:00	kernel	-	hn0: got notify, nvs type 128
          2022-03-29 07:56:58.504319-04:00	kernel	-	hn1: got notify, nvs type 128
          2022-03-29 07:56:58.504385-04:00	kernel	-	hn2: got notify, nvs type 128
          2022-03-29 07:56:58.504457-04:00	kernel	-	hn3: got notify, nvs type 128
          2022-03-29 07:56:58.504517-04:00	kernel	-	hn4: got notify, nvs type 128
          2022-03-29 07:56:58.504575-04:00	kernel	-	hn5: got notify, nvs type 128
          2022-03-29 07:56:58.504644-04:00	kernel	-	hn6: got notify, nvs type 128
          2022-03-29 07:56:58.504899-04:00	kernel	-	pcib0: <Hyper-V PCI Express Pass Through> on vmbus0
          2022-03-29 07:56:58.504985-04:00	kernel	-	pci0: <PCI bus> on pcib0
          
          1 Reply Last reply Reply Quote 0
          • X
            xpxp2002
            last edited by

            @stephenw10 As I'm looking through logs, it does look like there was at least a brief moment of loss of connectivity between the hypervisor and the switch. Still not clear why, though I'm leaning toward NIC or perhaps SFP module.

            Doesn't explain why pfSense ended up panicking, but making progress toward an initial cause.

            jimpJ 1 Reply Last reply Reply Quote 0
            • jimpJ
              jimp Rebel Alliance Developer Netgate @xpxp2002
              last edited by

              @xpxp2002 said in pfSense+ VM encountered page fault -- submit crash dump to Netgate?:

              @stephenw10 As I'm looking through logs, it does look like there was at least a brief moment of loss of connectivity between the hypervisor and the switch. Still not clear why, though I'm leaning toward NIC or perhaps SFP module.

              Doesn't explain why pfSense ended up panicking, but making progress toward an initial cause.

              It looks like the hypervisor tried to hotplug the actual PCI device, which isn't supported and has been known to flake out like that in the past for other similar events. It's more common on virtual hardware than real, but either way it's unsupported. The only safe way to add or remove hardware is while the VM is powered off. I wouldn't expect a hypervisor to remove the device from the VM like that, though, but that may be something you can configure in Hyper-V.

              Remember: Upvote with the ๐Ÿ‘ button for any user/post you find to be helpful, informative, or deserving of recognition!

              Need help fast? Netgate Global Support!

              Do not Chat/PM for help!

              X 1 Reply Last reply Reply Quote 1
              • X
                xpxp2002 @jimp
                last edited by

                @jimp Thanks. That definitely explains the panic, then. Now I just need to figure out why the hypervisor would attempt to hotplug the NIC during normal operation. It was up for at least 2 weeks, so not sure what event triggered that today.

                1 Reply Last reply Reply Quote 0
                • jimpJ
                  jimp Rebel Alliance Developer Netgate
                  last edited by

                  What you said earlier is entirely possible about a potential problem with the card or SFP module, or maybe it's something Hyper-V is doing unexpectedly on link loss. It wouldn't be too hard to test those theories. Snapshot the VM first and then see what happens when you unplug/replug the cable, then try the SFP module. See what Hyper-V does and if it has another panic.

                  Remember: Upvote with the ๐Ÿ‘ button for any user/post you find to be helpful, informative, or deserving of recognition!

                  Need help fast? Netgate Global Support!

                  Do not Chat/PM for help!

                  X 1 Reply Last reply Reply Quote 1
                  • X
                    xpxp2002 @jimp
                    last edited by

                    @jimp I will need to run through those steps after hours when I can take the outage.

                    I did find an interesting chain of events that may be related. The pfSense VM is replicated to another hypervisor, effectively in a cold standby configuration. That hypervisor with the replica had been undergoing a backup job, during which Hyper-V will not merge in new replica differences. A series of log entries on that hypervisor at the same time as the start of the NIC events on the primary/active VM suggests that the queue of replica updates reached a max threshold and required a resync of the VM from the primary.

                    'pfSense' requires resynchronization because it has reached the threshold for the size of accumulated log files.
                    

                    I'm removing this VM from the backup job on the replica hypervisor. I'm not sure why the resync would force the detach of the NIC, but that's also something I can try to research on the Hyper-V side or try to reproduce after hours.

                    I'm thinking that this resync may be the event that caused that hotplug event to occur on the hypervisor that had the running VM, given the timing of the logs. I maintain backups from within pfSense, so I don't really need the VM-level backup except to make it easy to layout the virtual hardware config (vNIC MAC addresses and whatnot), so I'm not concerned about excluding this VM from backup elsewhere.

                    1 Reply Last reply Reply Quote 1
                    • stephenw10S
                      stephenw10 Netgate Administrator
                      last edited by

                      Mmm, I note that's not just the NICs that were detached the actual PCI buses were detached. That should never happen. It could never happen on real hardware.

                      Steve

                      X 1 Reply Last reply Reply Quote 1
                      • stephenw10S stephenw10 moved this topic from General pfSense Questions on
                      • X
                        xpxp2002 @stephenw10
                        last edited by

                        @stephenw10 Oh wow. I misunderstood. Assumed this was the NIC detaching because the whole bus shouldn't be possible, as you said.

                        I need to dig into the Hyper-V side of this to understand that. That's absolutely baffling behavior. But hey, Microsoft...

                        1 Reply Last reply Reply Quote 0
                        • First post
                          Last post
                        Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.