pfSense+ VM encountered page fault -- submit crash dump to Netgate?
-
I was just watching TV (streaming) when the internet appeared to go offline. After about 2 minutes I got an alert email that pfSense just finished booting. There is a crash dump indicating "Fatal trap 12: page fault while in kernel mode".
Is Netgate interested in collecting the crash dump data? I did not see anywhere specific to submit or upload this. But it is extremely unusual, in my experience, for pfSense to crash this way.
-
The backtrace from the start of the panic dump in the ddb file is usually enough to recognize if it's something known. So like:
db:0:kdb.enter.default> bt Tracing pid 11 tid 100006 td 0xfffff80004341740 kdb_enter() at kdb_enter+0x37/frame 0xfffffe000249fe50 vpanic() at vpanic+0x197/frame 0xfffffe000249fea0 panic() at panic+0x43/frame 0xfffffe000249ff00 mca_intr() at mca_intr+0x9b/frame 0xfffffe000249ff20 mchk_calltrap() at mchk_calltrap+0x8/frame 0xfffffe000249ff20 --- trap 0x1c, rip = 0xffffffff815285f8, rsp = 0xfffffe003fb59970, rbp = 0xfffffe003fb59970 --- acpi_cpu_idle_mwait() at acpi_cpu_idle_mwait+0x68/frame 0xfffffe003fb59970 acpi_cpu_idle() at acpi_cpu_idle+0x186/frame 0xfffffe003fb599b0 cpu_idle_acpi() at cpu_idle_acpi+0x3e/frame 0xfffffe003fb599d0 cpu_idle() at cpu_idle+0x9f/frame 0xfffffe003fb599f0 sched_idletd() at sched_idletd+0x326/frame 0xfffffe003fb59ab0 fork_exit() at fork_exit+0x7e/frame 0xfffffe003fb59af0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe003fb59af0 --- trap 0, rip = 0, rsp = 0, rbp = 0 --- db:0:kdb.enter.default> ps
The last output from the message buffer can also help.
Steve
-
db:0:kdb.enter.default> bt Tracing pid 0 tid 100117 td 0xfffff80005a00000 kdb_enter() at kdb_enter+0x37/frame 0xfffffe002ba27600 vpanic() at vpanic+0x197/frame 0xfffffe002ba27650 panic() at panic+0x43/frame 0xfffffe002ba276b0 trap_fatal() at trap_fatal+0x391/frame 0xfffffe002ba27710 trap_pfault() at trap_pfault+0x4f/frame 0xfffffe002ba27760 trap() at trap+0x286/frame 0xfffffe002ba27870 calltrap() at calltrap+0x8/frame 0xfffffe002ba27870 --- trap 0xc, rip = 0xffffffff81494527, rsp = 0xfffffe002ba27940, rbp = 0xfffffe002ba27980 --- wait_for_response() at wait_for_response+0x67/frame 0xfffffe002ba27980 vmbus_pcib_attach() at vmbus_pcib_attach+0x38f/frame 0xfffffe002ba27a70 device_attach() at device_attach+0x3dd/frame 0xfffffe002ba27ac0 device_probe_and_attach() at device_probe_and_attach+0x41/frame 0xfffffe002ba27af0 vmbus_add_child() at vmbus_add_child+0x79/frame 0xfffffe002ba27b20 taskqueue_run_locked() at taskqueue_run_locked+0x144/frame 0xfffffe002ba27b80 taskqueue_thread_loop() at taskqueue_thread_loop+0xb6/frame 0xfffffe002ba27bb0 fork_exit() at fork_exit+0x7e/frame 0xfffffe002ba27bf0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe002ba27bf0
Fatal trap 12: page fault while in kernel mode cpuid = 2; apic id = 02 fault virtual address = 0x20 fault code = supervisor write data, page not present instruction pointer = 0x20:0xffffffff81494527 stack pointer = 0x28:0xfffffe002ba27940 frame pointer = 0x28:0xfffffe002ba27980 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 0 (vmbusdev) trap number = 12 panic: page fault cpuid = 2 time = 1648555018 KDB: enter: panic
-
Hmm, nope not familiar with that.
What hypervisor is it?
Anything logged before it crashed?The backtrace makes it look like it's seeing new hardware, which seems odd.
Steve
-
@stephenw10 Hyper-V on Windows Server 2019. There was no change in hardware.
However, the first relevant event I see in the log is that the vNICs associated with the physical NIC detached. What's odd is that I don't see other VMs attached to the same VM switch reflecting a loss of link at that time, so it doesn't suggest to me that the physical NIC itself went through a reset or loss of connectivity. Here's the last system log prior to the string of events showing the loss of NICs.
2022-03-29 07:50:00.046839-04:00 sshguard 65023 Now monitoring attacks. 2022-03-29 07:56:48.362702-04:00 kernel - mlx4_core0: detached 2022-03-29 07:56:48.362792-04:00 kernel - hn0: got notify, nvs type 128 2022-03-29 07:56:49.403129-04:00 kernel - mlx4_core1: detached 2022-03-29 07:56:49.403265-04:00 kernel - hn1: got notify, nvs type 128 2022-03-29 07:56:50.873181-04:00 kernel - mlx4_core2: detached 2022-03-29 07:56:50.873305-04:00 kernel - hn2: got notify, nvs type 128 2022-03-29 07:56:52.563102-04:00 kernel - mlx4_core3: detached 2022-03-29 07:56:52.563222-04:00 kernel - hn3: got notify, nvs type 128 2022-03-29 07:56:54.055682-04:00 kernel - mlx4_core4: detached 2022-03-29 07:56:54.055853-04:00 kernel - hn4: got notify, nvs type 128 2022-03-29 07:56:55.353642-04:00 kernel - mlx4_core5: detached 2022-03-29 07:56:55.563273-04:00 kernel - hn5: got notify, nvs type 128 2022-03-29 07:56:57.048901-04:00 kernel - mlx4_core6: detached 2022-03-29 07:56:57.049002-04:00 kernel - pci0: detached 2022-03-29 07:56:57.049058-04:00 kernel - hn6: got notify, nvs type 128 2022-03-29 07:56:57.049097-04:00 kernel - pcib0: detached 2022-03-29 07:56:57.049125-04:00 kernel - pci1: detached 2022-03-29 07:56:57.049229-04:00 kernel - pcib1: detached 2022-03-29 07:56:57.049297-04:00 kernel - pci2: detached 2022-03-29 07:56:57.049354-04:00 kernel - pcib2: detached 2022-03-29 07:56:57.049413-04:00 kernel - pci3: detached 2022-03-29 07:56:57.049487-04:00 kernel - pcib3: detached 2022-03-29 07:56:57.049570-04:00 kernel - pci4: detached 2022-03-29 07:56:57.049631-04:00 kernel - pcib4: detached 2022-03-29 07:56:57.049693-04:00 kernel - pci5: detached 2022-03-29 07:56:57.049762-04:00 kernel - pcib5: detached 2022-03-29 07:56:57.049803-04:00 kernel - pci6: detached 2022-03-29 07:56:57.049890-04:00 kernel - pcib6: detached 2022-03-29 07:56:58.504205-04:00 kernel - hn0: got notify, nvs type 128 2022-03-29 07:56:58.504319-04:00 kernel - hn1: got notify, nvs type 128 2022-03-29 07:56:58.504385-04:00 kernel - hn2: got notify, nvs type 128 2022-03-29 07:56:58.504457-04:00 kernel - hn3: got notify, nvs type 128 2022-03-29 07:56:58.504517-04:00 kernel - hn4: got notify, nvs type 128 2022-03-29 07:56:58.504575-04:00 kernel - hn5: got notify, nvs type 128 2022-03-29 07:56:58.504644-04:00 kernel - hn6: got notify, nvs type 128 2022-03-29 07:56:58.504899-04:00 kernel - pcib0: <Hyper-V PCI Express Pass Through> on vmbus0 2022-03-29 07:56:58.504985-04:00 kernel - pci0: <PCI bus> on pcib0
-
@stephenw10 As I'm looking through logs, it does look like there was at least a brief moment of loss of connectivity between the hypervisor and the switch. Still not clear why, though I'm leaning toward NIC or perhaps SFP module.
Doesn't explain why pfSense ended up panicking, but making progress toward an initial cause.
-
@xpxp2002 said in pfSense+ VM encountered page fault -- submit crash dump to Netgate?:
@stephenw10 As I'm looking through logs, it does look like there was at least a brief moment of loss of connectivity between the hypervisor and the switch. Still not clear why, though I'm leaning toward NIC or perhaps SFP module.
Doesn't explain why pfSense ended up panicking, but making progress toward an initial cause.
It looks like the hypervisor tried to hotplug the actual PCI device, which isn't supported and has been known to flake out like that in the past for other similar events. It's more common on virtual hardware than real, but either way it's unsupported. The only safe way to add or remove hardware is while the VM is powered off. I wouldn't expect a hypervisor to remove the device from the VM like that, though, but that may be something you can configure in Hyper-V.
-
@jimp Thanks. That definitely explains the panic, then. Now I just need to figure out why the hypervisor would attempt to hotplug the NIC during normal operation. It was up for at least 2 weeks, so not sure what event triggered that today.
-
What you said earlier is entirely possible about a potential problem with the card or SFP module, or maybe it's something Hyper-V is doing unexpectedly on link loss. It wouldn't be too hard to test those theories. Snapshot the VM first and then see what happens when you unplug/replug the cable, then try the SFP module. See what Hyper-V does and if it has another panic.
-
@jimp I will need to run through those steps after hours when I can take the outage.
I did find an interesting chain of events that may be related. The pfSense VM is replicated to another hypervisor, effectively in a cold standby configuration. That hypervisor with the replica had been undergoing a backup job, during which Hyper-V will not merge in new replica differences. A series of log entries on that hypervisor at the same time as the start of the NIC events on the primary/active VM suggests that the queue of replica updates reached a max threshold and required a resync of the VM from the primary.
'pfSense' requires resynchronization because it has reached the threshold for the size of accumulated log files.
I'm removing this VM from the backup job on the replica hypervisor. I'm not sure why the resync would force the detach of the NIC, but that's also something I can try to research on the Hyper-V side or try to reproduce after hours.
I'm thinking that this resync may be the event that caused that hotplug event to occur on the hypervisor that had the running VM, given the timing of the logs. I maintain backups from within pfSense, so I don't really need the VM-level backup except to make it easy to layout the virtual hardware config (vNIC MAC addresses and whatnot), so I'm not concerned about excluding this VM from backup elsewhere.
-
Mmm, I note that's not just the NICs that were detached the actual PCI buses were detached. That should never happen. It could never happen on real hardware.
Steve
-
-
@stephenw10 Oh wow. I misunderstood. Assumed this was the NIC detaching because the whole bus shouldn't be possible, as you said.
I need to dig into the Hyper-V side of this to understand that. That's absolutely baffling behavior. But hey, Microsoft...