Recurring crash 2.4.5-RELEASE-p1

hp_inkjet

Hi,
I'm having some troubles on one of my pfSense installations, crashing every ~20h

The instance giving me problems is the primary of my homelab HA cluster (same problems on the secondary if I poweroff the primary), the hypervisor of choice is Proxmox and I'm using virtio as paravirtualized nic (with checksum & offload disabled as best practice describes)

Looking at the crash report I can see that the current thread at the moment of crash is the virtio irq thread of one of the nics, but my knowledge in reading the crash log unfortunately ends here

I've deployed other (4) HA cluster on proxmox in the past e no one showed me this kind of behavior

Can someone more skilled than me suggest which should be the next step to troubleshoot the issue? I've attached the crash report crash-report.zip

stephenw10

Ok you have numerous identical crashes there that all look like this:

db:0:kdb.enter.default>  show pcpu
cpuid        = 1
dynamic pcpu = 0xfffffe01967ae580
curthread    = 0xfffff80004df9620: pid 12 "irq264: virtio_pci2"
curpcb       = 0xfffffe00f48efcc0
fpcurthread  = none
idlethread   = 0xfffff80004975620: tid 100004 "idle: cpu1"
curpmap      = 0xffffffff834f1c40
tssp         = 0xffffffff835a3338
commontssp   = 0xffffffff835a3338
rsp0         = 0xfffffe00f48efcc0
gs32p        = 0xffffffff835a9f90
ldt          = 0xffffffff835a9fd0
tss          = 0xffffffff835a9fc0
tlb gen      = 15068852
db:0:kdb.enter.default>  bt
Tracing pid 12 tid 100078 td 0xfffff80004df9620
kdb_enter() at kdb_enter+0x3b/frame 0xfffffe00f48eef30
vpanic() at vpanic+0x19b/frame 0xfffffe00f48eef90
panic() at panic+0x43/frame 0xfffffe00f48eeff0
trap_pfault() at trap_pfault/frame 0xfffffe00f48ef040
trap_pfault() at trap_pfault+0x49/frame 0xfffffe00f48ef0a0
trap() at trap+0x29d/frame 0xfffffe00f48ef1b0
calltrap() at calltrap+0x8/frame 0xfffffe00f48ef1b0
--- trap 0xc, rip = 0xffffffff80f9214e, rsp = 0xfffffe00f48ef280, rbp = 0xfffffe00f48ef3c0 ---
pf_test_state_tcp() at pf_test_state_tcp+0x19ae/frame 0xfffffe00f48ef3c0
pf_test() at pf_test+0x2112/frame 0xfffffe00f48ef5e0
pf_check_in() at pf_check_in+0x1d/frame 0xfffffe00f48ef600
pfil_run_hooks() at pfil_run_hooks+0x90/frame 0xfffffe00f48ef690
ip_input() at ip_input+0x412/frame 0xfffffe00f48ef720
netisr_dispatch_src() at netisr_dispatch_src+0xa2/frame 0xfffffe00f48ef770
ether_demux() at ether_demux+0x15b/frame 0xfffffe00f48ef7a0
ether_nh_input() at ether_nh_input+0x32c/frame 0xfffffe00f48ef800
netisr_dispatch_src() at netisr_dispatch_src+0xa2/frame 0xfffffe00f48ef850
ether_input() at ether_input+0x26/frame 0xfffffe00f48ef870
vlan_input() at vlan_input+0x215/frame 0xfffffe00f48ef920
ether_demux() at ether_demux+0x144/frame 0xfffffe00f48ef950
ether_nh_input() at ether_nh_input+0x32c/frame 0xfffffe00f48ef9b0
netisr_dispatch_src() at netisr_dispatch_src+0xa2/frame 0xfffffe00f48efa00
ether_input() at ether_input+0x26/frame 0xfffffe00f48efa20
vtnet_rxq_eof() at vtnet_rxq_eof+0x7ae/frame 0xfffffe00f48efaf0
vtnet_rx_vq_intr() at vtnet_rx_vq_intr+0x71/frame 0xfffffe00f48efb20
intr_event_execute_handlers() at intr_event_execute_handlers+0xe9/frame 0xfffffe00f48efb60
ithread_loop() at ithread_loop+0xe7/frame 0xfffffe00f48efbb0
fork_exit() at fork_exit+0x83/frame 0xfffffe00f48efbf0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00f48efbf0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
db:0:kdb.enter.default>  ps

This is not running a 2.5 development snap so I'm moving it to General for more exposure.

Steve

Gertjan

@hp_inkjet said in Recurring crash 2.4.5-RELEASE-p1:

virtio irq

Hi,

First things first : I'm not an expert.
Still, I guess the advise will be very useful : exclude what isn't really needed.
You have two choices : go bare metal or change the hyper visor.
You'll will know if it's the VM environment - or not, and you'll know where to focus on.

stephenw10

What's different about this config than the other installs?

You looks to have bridges and TAP interfaces here. Are they common to all sites?
I assume the bridges are connecting the TAP interfaces to local subnets? Other bridge setups in HA can easily go horribly wrong!

Do you have any traffic shaping enable here? It looks very similar to a previous bug that was AltQ related.

Steve

hp_inkjet

@stephenw10 First of all thank you for your feedback, nothing is really special about this install except for the presence of limiters (up/down on 2 guest networks).
The bridges connect 2 OVPN S2S to 2 local networks.

Could I be affected by the AltQ bug?

Matteo

stephenw10

It would be a new bug if so because this other one was fixed a long time ago: https://redmine.pfsense.org/issues/5473

Limiters are not AltQ either but the similarity of the back trace makes me thing something must be the same there.

Can you remove/disable the Limiters long enough to test?

Steve

hp_inkjet

Sure, thank you.

I'll report back in a couple of days with the result,
Matteo

hp_inkjet

After one day the problem represented itselfpfcrash.zip

stephenw10

So still the same identical crash.

And that was with limiters disabled? And no AtlQ shaping?

hp_inkjet

Yes, no limiters or AltQ