New Kernel Panic ... on Boot. pimd and/or interface-change related?
-
Here's a slightly strange one.
I've been running pfSense stably for months now in a "new" context: as VM in ProxMox.
I'm using CARP, and my Backup box is working fine.
My primary box is suddenly crashing just about on boot (not quite... most packages etc get configured. The crash generally is when it's about to "go live". Or... it runs for 1-2 hours then crashes. Never more than that for the last 18 hours.
Of course I'd like any ideas on diagnosing.
Here's my question and thought:- What changed: I was trying to get SR-IOV virtual NICs to work, and failed, so I undid that config in Proxmox (no changes to pfSense).
- The result of that: the PCI bus for my third (CARP/HA) interface got shifted from 00:02... to 00:03...
QUESTION: Is it possible that pfSense has some internal...something, that doesn't like a changed PCI bus number for an interface?
All thoughts MOST welcome.
PeteStarting package acme...done. <6>ng0: changing name to 'pppoe0' <6>stf0: changing name to 'wan_stf' Fatal trap 12: page fault while in kernel mode cpuid = 2; apic id = 02 fault virtual address = 0x2000 fault code = supervisor write data, page not present instruction pointer = 0x20:0xffffffff80eafeb5 stack pointer = 0x28:0xfffffe00020995a0 frame pointer = 0x28:0xfffffe00020995a0 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 12 (irq263: virtio_pci3) trap number = 12 panic: page fault cpuid = 2 time = 1645732301 KDB: enter: panic
And a bit of the panic dump:
Tracing pid 12 tid 100117 td 0xfffff8003004d000 kdb_enter() at kdb_enter+0x37/frame 0xfffffe0002099260 vpanic() at vpanic+0x197/frame 0xfffffe00020992b0 panic() at panic+0x43/frame 0xfffffe0002099310 trap_fatal() at trap_fatal+0x391/frame 0xfffffe0002099370 trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00020993c0 trap() at trap+0x286/frame 0xfffffe00020994d0 calltrap() at calltrap+0x8/frame 0xfffffe00020994d0 --- trap 0xc, rip = 0xffffffff80eafeb5, rsp = 0xfffffe00020995a0, rbp = 0xfffffe00020995a0 --- if_inc_counter() at if_inc_counter+0x15/frame 0xfffffe00020995a0 if_simloop() at if_simloop+0xd1/frame 0xfffffe00020995e0 pim_input() at pim_input+0x409/frame 0xfffffe0002099640 encap_input() at encap_input+0xd1/frame 0xfffffe00020996b0 encap4_input() at encap4_input+0x28/frame 0xfffffe00020996e0 ip_input() at ip_input+0x168/frame 0xfffffe0002099790 netisr_dispatch_src() at netisr_dispatch_src+0xca/frame 0xfffffe00020997e0 ether_demux() at ether_demux+0x16a/frame 0xfffffe0002099810 ether_nh_input() at ether_nh_input+0x330/frame 0xfffffe0002099870 netisr_dispatch_src() at netisr_dispatch_src+0xca/frame 0xfffffe00020998c0 ether_input() at ether_input+0x4b/frame 0xfffffe00020998f0 vlan_input() at vlan_input+0x1f3/frame 0xfffffe0002099940 ether_demux() at ether_demux+0x153/frame 0xfffffe0002099970 ether_nh_input() at ether_nh_input+0x330/frame 0xfffffe00020999d0 netisr_dispatch_src() at netisr_dispatch_src+0xca/frame 0xfffffe0002099a20 ether_input() at ether_input+0x4b/frame 0xfffffe0002099a50 vtnet_rxq_eof() at vtnet_rxq_eof+0x7a5/frame 0xfffffe0002099b10 vtnet_rx_vq_process() at vtnet_rx_vq_process+0xb7/frame 0xfffffe0002099b50 ithread_loop() at ithread_loop+0x23c/frame 0xfffffe0002099bb0 fork_exit() at fork_exit+0x7e/frame 0xfffffe0002099bf0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0002099bf0 --- trap 0, rip = 0, rsp = 0, rbp = 0 ---
-
pfSense doesn't care about bus locations only the driver names, but that wouldn't cause a panic it would make the interfaces mismatch if that were a problem.
That panic doesn't look familiar.
Did you have a snapshot of the VM before you started making changes? Rolling back to an old snapshot would undo any changes made to the VM config as well I believe.
It could still be related to a change in the VM settings at fault though I've not seen anything in my Proxmox VMs crash like that with any combination of settings I've messed with in the past.
That or re-create the VM config from scratch (including the MAC addresses) and import the config and see if the panic still happens in a fresh VM.
-
-
Thanks, @jimp ...
I did a full VM restore from before the issues, and yes that guarantees a restoration of per-VM config. The panic above was from that restored VM.However, the changed PCI bus is still in place, since that's in "virtual" hardware.
I just realized, there's a linux host boot-time setting I can remove (pci=assign-busses) that enables the PCI bus shift. I'll try again with that removed to see if it makes any difference.
-
@jimd --
- I have eliminated a few distractions
-
(Intense memory test just in case ;) )
-
Restored PCI bus number
It still crashes
-
I then looked at the panic info, and also the logs on my (nicely functioning) backup CARP box...
I am running pfSense 2.52 stable, with pimd 0.0.3_4
Observations and Questions:
-
0.0.3_5 is now out. Apparently this is about ensuring there's no rc start file if PIMD is disabled. I've got it enabled...
-
The log says pimd is starting twice at boot. Clearly that can cause some trouble. I thought this was fixed long before 0.0.3_4?
-
Perhaps more telling: I have not touched my PIMD configuration in quite some time. I notice (in my running backup CARP, and in config.xml) that two VLAN interfaces I deleted a year ago are still listed in the pimd config...
-
...in fact, looking at a config.xml, those two nonexistent VLAN interfaces are still present in a number of places:
- active firewall rules
- several package configs
- and more
I notice that this panic seems to involve kernel pim_sm code... which makes me think perhaps that such invalid config info can be a bit more dangerous than one might imagine?
Happy to do some testing on this.... (Will keep the WAN/LAN port on this box disconnected from everything if possible. I don't want to kill our live connection... Hmm: can I manually from single-user-mode ensure this box doesn't become Primary CARP on boot?)
(If it would help, I can privately get a config.xml to you, with notes on the two missing interfaces. If deleting a VLAN interface should be "clean", this is a gold mine of MISbSINuG FEgATUsRES
-
@jimp Any thoughts on:
- Whether having invalid VLAN interfaces referenced in a config could cause various trouble, even Kernel panic?
- Whether it is known/expected that removal of VLAN interfaces in the GUI does not remove them from the various configuration sections?
I just want to understand how to move forward from here.
-
Hello,
Similar experience while prototyping pimd with IPSEC and wireguard. I didn't add any VLANs but had added and removed some interfaces. Attaching config file (nothing confidential) and a copy paste of the error log window in pfsense. Let me know if there's something else I can provide. pfsense_crash_pimd.7zFatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address = 0x0
fault code = supervisor write data, page not present
instruction pointer = 0x20:0xffffffff80ea0fd5
stack pointer = 0x0:0xfffffe00004d3960
frame pointer = 0x0:0xfffffe00004d3960
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 0 (wg_tqg_0)
trap number = 12
panic: page fault
cpuid = 0
time = 1646999897
KDB: enter: panic -
@potata_netgato @jimp
I have discussed with the pimd author.He's convinced that, since pimd is a userland app, there ought to be no way his app can cause a kernel panic.
The panic is in the "pim" area of the kernel...
What this indicates is the kernel is not fully vetting parameters for some kind of system call or setting. I'll try to come up with a bug report for BSD.
-
While that is true that a userland program should not be able to trigger a kernel panic,
pimd
should still ignore interfaces that don't exist. One doesn't absolve the other of having a bug in this case, they are both doing something they shouldn't.