Crash report - Fatal trap 12: page fault while in kernel mode (on VMWARE)

fresnoboy

Hi. Since doing an upgrade to 2.4.5, I have had 2 crashes over the last month.

This is running as a guest vm on a current patched ESXi 6.7U3 host. Any ideas?

stephenw10

Nothing obvious there unfortunately.

Could be something in pfatt.....

Panics seem to immediately follow Avahi crashing out. Could be cause or symptom. I would try running with that disabled though if you can.

Steve

fresnoboy

@stephenw10

Thanks for the reply. I can't run without avahi, because the house's music and AV systems all use Chromecasts...

When I looked at the logs, it looked like some devices changed MAC addresses. That could be a chromecast that dropped off an ethernet adapter and then came back on Wifi. But that appeared to happen after the reboot, or am I not looking at the log entry properly?

Was avahi crashing in both cases? Did avahi get upgraded when going from 2.4.4 to 2.4.5?

stephenw10

Yes Avahi will have been updated.
You are seeing some arp movement logs like:

arp: 10.1.30.150 moved from 54:60:09:c0:d1:4e to 00:e0:4c:36:86:d2 on vmx0.30

Since that's not Apple it could be an actual IP conflict.

Check what those MACs (and the others shown) belong to.

Steve

fresnoboy

@stephenw10

Stephen, they are all Chromecast Audio devices. The 00:e0 MAC addresses are for the USB connected ethernet adapter. The 54:60 addresses are the builtin wifi adapters. All of them end up on VLAN30, which is the vmx0.30 VLAN addreess.

If the switch they are plugged into goes reboots, they can flip back to wifi, and then back to ethernet when the switch comes back online. The MAC addresses are different, but the chromecast will want the same IP address via DHCP since it's the same device and network.

But I update switches all the time (they are unifi US-48's), and they don't crash pfense when I do that. The last time it happened was at 1 AM local time, and no switch upgrade happened then.

I guess if the chromecast did a software update and rebooted, that could cause such a transition as well. Not sure why that should cause avahi trouble, and even if avahi crashed, why would it cause a kernel panic?

stephenw10

It shouldn't, I agree. And that looks like legitimate use of the same IP.

You may want to just stop logging those:
https://docs.netgate.com/pfsense/en/latest/troubleshooting/logs-arp-moved.html

The two crashes shown have different backtraces:

db:0:kdb.enter.default>  bt
Tracing pid 0 tid 100079 td 0xfffff80006654620
kdb_enter() at kdb_enter+0x3b/frame 0xfffffe02387ef640
vpanic() at vpanic+0x19b/frame 0xfffffe02387ef6a0
panic() at panic+0x43/frame 0xfffffe02387ef700
bpf_buffer_append_mbuf() at bpf_buffer_append_mbuf+0x64/frame 0xfffffe02387ef730
catchpacket() at catchpacket+0x4b9/frame 0xfffffe02387ef7e0
bpf_mtap() at bpf_mtap+0x200/frame 0xfffffe02387ef850
ether_nh_input() at ether_nh_input+0xe9/frame 0xfffffe02387ef8b0
netisr_dispatch_src() at netisr_dispatch_src+0xa2/frame 0xfffffe02387ef900
ether_input() at ether_input+0x26/frame 0xfffffe02387ef920
if_input() at if_input+0xa/frame 0xfffffe02387ef930
em_rxeof() at em_rxeof+0x2e1/frame 0xfffffe02387ef9a0
em_handle_que() at em_handle_que+0x40/frame 0xfffffe02387ef9e0
taskqueue_run_locked() at taskqueue_run_locked+0x185/frame 0xfffffe02387efa40
taskqueue_thread_loop() at taskqueue_thread_loop+0xb8/frame 0xfffffe02387efa70
fork_exit() at fork_exit+0x83/frame 0xfffffe02387efab0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe02387efab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

db:0:kdb.enter.default>  bt
Tracing pid 80764 tid 100760 td 0xfffff801e3d52620
kdb_enter() at kdb_enter+0x3b/frame 0xfffffe0238ba3200
vpanic() at vpanic+0x19b/frame 0xfffffe0238ba3260
panic() at panic+0x43/frame 0xfffffe0238ba32c0
trap_pfault() at trap_pfault/frame 0xfffffe0238ba3310
trap_pfault() at trap_pfault+0x49/frame 0xfffffe0238ba3370
trap() at trap+0x29d/frame 0xfffffe0238ba3480
calltrap() at calltrap+0x8/frame 0xfffffe0238ba3480
--- trap 0xc, rip = 0xffffffff80d579c3, rsp = 0xfffffe0238ba3550, rbp = 0xfffffe0238ba3560 ---
m_tag_delete_chain() at m_tag_delete_chain+0x83/frame 0xfffffe0238ba3560
mb_dtor_pack() at mb_dtor_pack+0x11/frame 0xfffffe0238ba3570
uma_zfree_arg() at uma_zfree_arg+0x41/frame 0xfffffe0238ba35d0
mb_free_ext() at mb_free_ext+0x101/frame 0xfffffe0238ba3600
m_freem() at m_freem+0x48/frame 0xfffffe0238ba3620
vmxnet3_stop() at vmxnet3_stop+0x283/frame 0xfffffe0238ba3670
vmxnet3_init_locked() at vmxnet3_init_locked+0x27/frame 0xfffffe0238ba3700
vmxnet3_ioctl() at vmxnet3_ioctl+0x39c/frame 0xfffffe0238ba3740
ifhwioctl() at ifhwioctl+0x5f3/frame 0xfffffe0238ba37a0
ifioctl() at ifioctl+0x475/frame 0xfffffe0238ba3840
kern_ioctl() at kern_ioctl+0x267/frame 0xfffffe0238ba38b0
sys_ioctl() at sys_ioctl+0x15b/frame 0xfffffe0238ba3980
amd64_syscall() at amd64_syscall+0xa86/frame 0xfffffe0238ba3ab0
fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe0238ba3ab0
--- syscall (54, FreeBSD ELF64, sys_ioctl), rip = 0x804e69fca, rsp = 0x7fffdfffbd58, rbp = 0x7fffdfffc5b0 ---

Seems to be in mbufs for the first one. There are no mbuf exhaustion messages but make sure you have that set to 1M and shown as such on the dashboard.
Looks almost exactly like this: https://forum.netgate.com/topic/147078/pfsense-reboot-kernel-panic-bpf_mcopy-v2-4-4-p3 No specific cause there though.

Steve

fresnoboy

@stephenw10 Thanks for looking into this. I do have 1M MBUFs set as reflected in the dashboard.

As per the thread, I have increased the frags limit to 10000 (it was set at 5000 which is the default), and will see if that helps anything.

I do have a gigabit fiber connection, but it's not clear that should cause changes to the defaults, but if there are things I need to change, I'm happy to try it.

The system has been running fine for 2 days now. I'll keep an eye on it and see if it stays stable.

I can't ever remember a crash in this configuration under 2.4.4. Were there any changes in 2.4.5 that could have caused a problem?

Also, I did install the latest set of critical Vmware patches to 6.7U3 about a week before the first crash. Any change that could have affected something? The system is using ECC memory, and I am not seeing errors, so I think the hardware seems to not be a cause.

stephenw10

Mmm, there are not any frags limit log entries in the message buffer so you probably don't need to increase that. It won't hurt though.

There are no specific issues I'm aware of with VMWare and 2.4.5/p1. Nor with VMWare updates.

Steve

fresnoboy

textdump.tar.2.zip @stephenw10

Well, I just had another outage. Same mbuf panic, and this with double the frags I had allocated before.

Txtdump attached. Would love some ideas, or maybe I should revert back to 2.4.4? The config is backward compatible to 2.4.4 right?

thx!

stephenw10

No, current pfSense versions can import and update older config file versions but not the other way around. It might work OK.
But other thread showing this was running 2.4.4p3 anyway so I would suggest going to a 2.5 snapshot if you're going to do anything.

Steve

fresnoboy

@stephenw10

Ok. It's easy enough to take a snapshot that I can revert to since it's a vmware guest. I will go try a 2.5 version and see if it is better. Do you think there are relevant changes in the 3.5 train that could address this, or is it just trying something newer?

Does this crash have any more helpful data than the other two?

stephenw10

No that crash looks pretty much identical.

There are a lot of changes in pfSense 2.5 due to the FreeBSD 12 base. There are a whole raft of NIC changes that could affect this.

Steve

fresnoboy

@stephenw10

That makes a ton of sense. Will try it out today.