pfSense+ VM encountered page fault -- submit crash dump to Netgate?

xpxp2002

db:0:kdb.enter.default>  bt
Tracing pid 0 tid 100117 td 0xfffff80005a00000
kdb_enter() at kdb_enter+0x37/frame 0xfffffe002ba27600
vpanic() at vpanic+0x197/frame 0xfffffe002ba27650
panic() at panic+0x43/frame 0xfffffe002ba276b0
trap_fatal() at trap_fatal+0x391/frame 0xfffffe002ba27710
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe002ba27760
trap() at trap+0x286/frame 0xfffffe002ba27870
calltrap() at calltrap+0x8/frame 0xfffffe002ba27870
--- trap 0xc, rip = 0xffffffff81494527, rsp = 0xfffffe002ba27940, rbp = 0xfffffe002ba27980 ---
wait_for_response() at wait_for_response+0x67/frame 0xfffffe002ba27980
vmbus_pcib_attach() at vmbus_pcib_attach+0x38f/frame 0xfffffe002ba27a70
device_attach() at device_attach+0x3dd/frame 0xfffffe002ba27ac0
device_probe_and_attach() at device_probe_and_attach+0x41/frame 0xfffffe002ba27af0
vmbus_add_child() at vmbus_add_child+0x79/frame 0xfffffe002ba27b20
taskqueue_run_locked() at taskqueue_run_locked+0x144/frame 0xfffffe002ba27b80
taskqueue_thread_loop() at taskqueue_thread_loop+0xb6/frame 0xfffffe002ba27bb0
fork_exit() at fork_exit+0x7e/frame 0xfffffe002ba27bf0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe002ba27bf0

Fatal trap 12: page fault while in kernel mode
cpuid = 2; apic id = 02
fault virtual address	= 0x20
fault code		= supervisor write data, page not present
instruction pointer	= 0x20:0xffffffff81494527
stack pointer	        = 0x28:0xfffffe002ba27940
frame pointer	        = 0x28:0xfffffe002ba27980
code segment		= base 0x0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 0 (vmbusdev)
trap number		= 12
panic: page fault
cpuid = 2
time = 1648555018
KDB: enter: panic

stephenw10

Hmm, nope not familiar with that.
What hypervisor is it?
Anything logged before it crashed?

The backtrace makes it look like it's seeing new hardware, which seems odd.

Steve

xpxp2002

@stephenw10 Hyper-V on Windows Server 2019. There was no change in hardware.

However, the first relevant event I see in the log is that the vNICs associated with the physical NIC detached. What's odd is that I don't see other VMs attached to the same VM switch reflecting a loss of link at that time, so it doesn't suggest to me that the physical NIC itself went through a reset or loss of connectivity. Here's the last system log prior to the string of events showing the loss of NICs.

2022-03-29 07:50:00.046839-04:00	sshguard	65023	Now monitoring attacks.
2022-03-29 07:56:48.362702-04:00	kernel	-	mlx4_core0: detached
2022-03-29 07:56:48.362792-04:00	kernel	-	hn0: got notify, nvs type 128
2022-03-29 07:56:49.403129-04:00	kernel	-	mlx4_core1: detached
2022-03-29 07:56:49.403265-04:00	kernel	-	hn1: got notify, nvs type 128
2022-03-29 07:56:50.873181-04:00	kernel	-	mlx4_core2: detached
2022-03-29 07:56:50.873305-04:00	kernel	-	hn2: got notify, nvs type 128
2022-03-29 07:56:52.563102-04:00	kernel	-	mlx4_core3: detached
2022-03-29 07:56:52.563222-04:00	kernel	-	hn3: got notify, nvs type 128
2022-03-29 07:56:54.055682-04:00	kernel	-	mlx4_core4: detached
2022-03-29 07:56:54.055853-04:00	kernel	-	hn4: got notify, nvs type 128
2022-03-29 07:56:55.353642-04:00	kernel	-	mlx4_core5: detached
2022-03-29 07:56:55.563273-04:00	kernel	-	hn5: got notify, nvs type 128
2022-03-29 07:56:57.048901-04:00	kernel	-	mlx4_core6: detached
2022-03-29 07:56:57.049002-04:00	kernel	-	pci0: detached
2022-03-29 07:56:57.049058-04:00	kernel	-	hn6: got notify, nvs type 128
2022-03-29 07:56:57.049097-04:00	kernel	-	pcib0: detached
2022-03-29 07:56:57.049125-04:00	kernel	-	pci1: detached
2022-03-29 07:56:57.049229-04:00	kernel	-	pcib1: detached
2022-03-29 07:56:57.049297-04:00	kernel	-	pci2: detached
2022-03-29 07:56:57.049354-04:00	kernel	-	pcib2: detached
2022-03-29 07:56:57.049413-04:00	kernel	-	pci3: detached
2022-03-29 07:56:57.049487-04:00	kernel	-	pcib3: detached
2022-03-29 07:56:57.049570-04:00	kernel	-	pci4: detached
2022-03-29 07:56:57.049631-04:00	kernel	-	pcib4: detached
2022-03-29 07:56:57.049693-04:00	kernel	-	pci5: detached
2022-03-29 07:56:57.049762-04:00	kernel	-	pcib5: detached
2022-03-29 07:56:57.049803-04:00	kernel	-	pci6: detached
2022-03-29 07:56:57.049890-04:00	kernel	-	pcib6: detached
2022-03-29 07:56:58.504205-04:00	kernel	-	hn0: got notify, nvs type 128
2022-03-29 07:56:58.504319-04:00	kernel	-	hn1: got notify, nvs type 128
2022-03-29 07:56:58.504385-04:00	kernel	-	hn2: got notify, nvs type 128
2022-03-29 07:56:58.504457-04:00	kernel	-	hn3: got notify, nvs type 128
2022-03-29 07:56:58.504517-04:00	kernel	-	hn4: got notify, nvs type 128
2022-03-29 07:56:58.504575-04:00	kernel	-	hn5: got notify, nvs type 128
2022-03-29 07:56:58.504644-04:00	kernel	-	hn6: got notify, nvs type 128
2022-03-29 07:56:58.504899-04:00	kernel	-	pcib0: <Hyper-V PCI Express Pass Through> on vmbus0
2022-03-29 07:56:58.504985-04:00	kernel	-	pci0: <PCI bus> on pcib0

xpxp2002

@stephenw10 As I'm looking through logs, it does look like there was at least a brief moment of loss of connectivity between the hypervisor and the switch. Still not clear why, though I'm leaning toward NIC or perhaps SFP module.

Doesn't explain why pfSense ended up panicking, but making progress toward an initial cause.

jimp

@xpxp2002 said in pfSense+ VM encountered page fault -- submit crash dump to Netgate?:

@stephenw10 As I'm looking through logs, it does look like there was at least a brief moment of loss of connectivity between the hypervisor and the switch. Still not clear why, though I'm leaning toward NIC or perhaps SFP module.

Doesn't explain why pfSense ended up panicking, but making progress toward an initial cause.

It looks like the hypervisor tried to hotplug the actual PCI device, which isn't supported and has been known to flake out like that in the past for other similar events. It's more common on virtual hardware than real, but either way it's unsupported. The only safe way to add or remove hardware is while the VM is powered off. I wouldn't expect a hypervisor to remove the device from the VM like that, though, but that may be something you can configure in Hyper-V.

xpxp2002

@jimp Thanks. That definitely explains the panic, then. Now I just need to figure out why the hypervisor would attempt to hotplug the NIC during normal operation. It was up for at least 2 weeks, so not sure what event triggered that today.

jimp

What you said earlier is entirely possible about a potential problem with the card or SFP module, or maybe it's something Hyper-V is doing unexpectedly on link loss. It wouldn't be too hard to test those theories. Snapshot the VM first and then see what happens when you unplug/replug the cable, then try the SFP module. See what Hyper-V does and if it has another panic.

xpxp2002

@jimp I will need to run through those steps after hours when I can take the outage.

I did find an interesting chain of events that may be related. The pfSense VM is replicated to another hypervisor, effectively in a cold standby configuration. That hypervisor with the replica had been undergoing a backup job, during which Hyper-V will not merge in new replica differences. A series of log entries on that hypervisor at the same time as the start of the NIC events on the primary/active VM suggests that the queue of replica updates reached a max threshold and required a resync of the VM from the primary.

'pfSense' requires resynchronization because it has reached the threshold for the size of accumulated log files.

I'm removing this VM from the backup job on the replica hypervisor. I'm not sure why the resync would force the detach of the NIC, but that's also something I can try to research on the Hyper-V side or try to reproduce after hours.

I'm thinking that this resync may be the event that caused that hotplug event to occur on the hypervisor that had the running VM, given the timing of the logs. I maintain backups from within pfSense, so I don't really need the VM-level backup except to make it easy to layout the virtual hardware config (vNIC MAC addresses and whatnot), so I'm not concerned about excluding this VM from backup elsewhere.

stephenw10

Mmm, I note that's not just the NICs that were detached the actual PCI buses were detached. That should never happen. It could never happen on real hardware.

Steve

xpxp2002

@stephenw10 Oh wow. I misunderstood. Assumed this was the NIC detaching because the whole bus shouldn't be possible, as you said.

I need to dig into the Hyper-V side of this to understand that. That's absolutely baffling behavior. But hey, Microsoft...