New Kernel Panic ... on Boot. pimd and/or interface-change related?

MrPete

Here's a slightly strange one.

I've been running pfSense stably for months now in a "new" context: as VM in ProxMox.

I'm using CARP, and my Backup box is working fine.

My primary box is suddenly crashing just about on boot (not quite... most packages etc get configured. The crash generally is when it's about to "go live". Or... it runs for 1-2 hours then crashes. Never more than that for the last 18 hours.

Of course I'd like any ideas on diagnosing.
Here's my question and thought:

What changed: I was trying to get SR-IOV virtual NICs to work, and failed, so I undid that config in Proxmox (no changes to pfSense).
The result of that: the PCI bus for my third (CARP/HA) interface got shifted from 00:02... to 00:03...

QUESTION: Is it possible that pfSense has some internal...something, that doesn't like a changed PCI bus number for an interface?

All thoughts MOST welcome.
Pete

Starting package acme...done.
<6>ng0: changing name to 'pppoe0'
<6>stf0: changing name to 'wan_stf'

Fatal trap 12: page fault while in kernel mode
cpuid = 2; apic id = 02
fault virtual address	= 0x2000
fault code		= supervisor write data, page not present
instruction pointer	= 0x20:0xffffffff80eafeb5
stack pointer	        = 0x28:0xfffffe00020995a0
frame pointer	        = 0x28:0xfffffe00020995a0
code segment		= base 0x0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 12 (irq263: virtio_pci3)
trap number		= 12
panic: page fault
cpuid = 2
time = 1645732301
KDB: enter: panic

And a bit of the panic dump:

Tracing pid 12 tid 100117 td 0xfffff8003004d000
kdb_enter() at kdb_enter+0x37/frame 0xfffffe0002099260
vpanic() at vpanic+0x197/frame 0xfffffe00020992b0
panic() at panic+0x43/frame 0xfffffe0002099310
trap_fatal() at trap_fatal+0x391/frame 0xfffffe0002099370
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00020993c0
trap() at trap+0x286/frame 0xfffffe00020994d0
calltrap() at calltrap+0x8/frame 0xfffffe00020994d0
--- trap 0xc, rip = 0xffffffff80eafeb5, rsp = 0xfffffe00020995a0, rbp = 0xfffffe00020995a0 ---
if_inc_counter() at if_inc_counter+0x15/frame 0xfffffe00020995a0
if_simloop() at if_simloop+0xd1/frame 0xfffffe00020995e0
pim_input() at pim_input+0x409/frame 0xfffffe0002099640
encap_input() at encap_input+0xd1/frame 0xfffffe00020996b0
encap4_input() at encap4_input+0x28/frame 0xfffffe00020996e0
ip_input() at ip_input+0x168/frame 0xfffffe0002099790
netisr_dispatch_src() at netisr_dispatch_src+0xca/frame 0xfffffe00020997e0
ether_demux() at ether_demux+0x16a/frame 0xfffffe0002099810
ether_nh_input() at ether_nh_input+0x330/frame 0xfffffe0002099870
netisr_dispatch_src() at netisr_dispatch_src+0xca/frame 0xfffffe00020998c0
ether_input() at ether_input+0x4b/frame 0xfffffe00020998f0
vlan_input() at vlan_input+0x1f3/frame 0xfffffe0002099940
ether_demux() at ether_demux+0x153/frame 0xfffffe0002099970
ether_nh_input() at ether_nh_input+0x330/frame 0xfffffe00020999d0
netisr_dispatch_src() at netisr_dispatch_src+0xca/frame 0xfffffe0002099a20
ether_input() at ether_input+0x4b/frame 0xfffffe0002099a50
vtnet_rxq_eof() at vtnet_rxq_eof+0x7a5/frame 0xfffffe0002099b10
vtnet_rx_vq_process() at vtnet_rx_vq_process+0xb7/frame 0xfffffe0002099b50
ithread_loop() at ithread_loop+0x23c/frame 0xfffffe0002099bb0
fork_exit() at fork_exit+0x7e/frame 0xfffffe0002099bf0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0002099bf0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

jimp

pfSense doesn't care about bus locations only the driver names, but that wouldn't cause a panic it would make the interfaces mismatch if that were a problem.

That panic doesn't look familiar.

Did you have a snapshot of the VM before you started making changes? Rolling back to an old snapshot would undo any changes made to the VM config as well I believe.

It could still be related to a change in the VM settings at fault though I've not seen anything in my Proxmox VMs crash like that with any combination of settings I've messed with in the past.

That or re-create the VM config from scratch (including the MAC addresses) and import the config and see if the panic still happens in a fresh VM.

MrPete

Thanks, @jimp ...
I did a full VM restore from before the issues, and yes that guarantees a restoration of per-VM config. The panic above was from that restored VM.

However, the changed PCI bus is still in place, since that's in "virtual" hardware.

I just realized, there's a linux host boot-time setting I can remove (pci=assign-busses) that enables the PCI bus shift. I'll try again with that removed to see if it makes any difference.

MrPete

@jimd --

I have eliminated a few distractions

(Intense memory test just in case ;) )
Restored PCI bus number

It still crashes

I then looked at the panic info, and also the logs on my (nicely functioning) backup CARP box...

I am running pfSense 2.52 stable, with pimd 0.0.3_4

Observations and Questions:

0.0.3_5 is now out. Apparently this is about ensuring there's no rc start file if PIMD is disabled. I've got it enabled...
The log says pimd is starting twice at boot. Clearly that can cause some trouble. I thought this was fixed long before 0.0.3_4?
Perhaps more telling: I have not touched my PIMD configuration in quite some time. I notice (in my running backup CARP, and in config.xml) that two VLAN interfaces I deleted a year ago are still listed in the pimd config...
...in fact, looking at a config.xml, those two nonexistent VLAN interfaces are still present in a number of places:
- active firewall rules
- several package configs
- and more

I notice that this panic seems to involve kernel pim_sm code... which makes me think perhaps that such invalid config info can be a bit more dangerous than one might imagine?

Happy to do some testing on this.... (Will keep the WAN/LAN port on this box disconnected from everything if possible. I don't want to kill our live connection... Hmm: can I manually from single-user-mode ensure this box doesn't become Primary CARP on boot?)

(If it would help, I can privately get a config.xml to you, with notes on the two missing interfaces. If deleting a VLAN interface should be "clean", this is a gold mine of MISbSINuG FEgATUsRES

MrPete

@jimp Any thoughts on:

Whether having invalid VLAN interfaces referenced in a config could cause various trouble, even Kernel panic?
Whether it is known/expected that removal of VLAN interfaces in the GUI does not remove them from the various configuration sections?

I just want to understand how to move forward from here.

potata_netgato

Hello,
Similar experience while prototyping pimd with IPSEC and wireguard. I didn't add any VLANs but had added and removed some interfaces. Attaching config file (nothing confidential) and a copy paste of the error log window in pfsense. Let me know if there's something else I can provide. pfsense_crash_pimd.7z

Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address = 0x0
fault code = supervisor write data, page not present
instruction pointer = 0x20:0xffffffff80ea0fd5
stack pointer = 0x0:0xfffffe00004d3960
frame pointer = 0x0:0xfffffe00004d3960
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 0 (wg_tqg_0)
trap number = 12
panic: page fault
cpuid = 0
time = 1646999897
KDB: enter: panic

MrPete

@potata_netgato @jimp
I have discussed with the pimd author.

He's convinced that, since pimd is a userland app, there ought to be no way his app can cause a kernel panic.

The panic is in the "pim" area of the kernel...

What this indicates is the kernel is not fully vetting parameters for some kind of system call or setting. I'll try to come up with a bug report for BSD.

jimp

While that is true that a userland program should not be able to trigger a kernel panic, pimd should still ignore interfaces that don't exist. One doesn't absolve the other of having a bug in this case, they are both doing something they shouldn't.