Kernel Panic when enabling CODELQ on multiple Vlans and freeze on reboot setting up routes
-
Hi all, I am encountering a reproduceable Kernel Panic which propagates to the secondary pfSense (latest version 2.6) in the CARP HA Cluster (so double trouble) when I enable a CODELQ Traffic limiter on a third Vlan interface. I have 2 CODELQ limiters enabled without issues, one on a physical iface (bge2) and one on a Vlan through another interface (bge0 vlan 102). If I enable a 3rd CODELQ limiter through the same interface (bge2) as the first where it was already enabled it crashes and burns taking down the partner box as it syncs the config immediately (and the second behaves as the first confirming the reproduceability). The main pfsense reboots but will not complete the boot freezing on "applying routes" if I recall the exact syntax while the second (slave) reboots correctly as the config.xml is not found and reloads the previous one without issues and becomes CARP master. To recover from this mayhem I have to manually boot the first node in safe mode single user and copy the previous saved config.xlm over after running an fsck. It then reboots with the prior config without the 3rd limiter set and all is well.
Here is an excerpt of the textdump.0 - what catches the eye is the "bge2 taskq" which makes me think it's related and caused by the cascading limiter on the vlan on the same interface:
db:1:lockinfo> show lockedvnods
Locked vnodes
db:0:kdb.enter.default> show pcpu
cpuid = 1
dynamic pcpu = 0xfffffe0080d98200
curthread = 0xfffff80004adf740: pid 0 tid 100065 "bge2 taskq"
curpcb = 0xfffff80004adfce0
fpcurthread = none
idlethread = 0xfffff80004619740: tid 100004 "idle: cpu1"
curpmap = 0xffffffff8368f6e8
tssp = 0xffffffff83719808
commontssp = 0xffffffff83719808
rsp0 = 0xfffffe00004edcc0
kcr3 = 0x8000000003d06002
ucr3 = 0xffffffffffffffff
scr3 = 0x368dfaeca
gs32p = 0xffffffff83720020
ldt = 0xffffffff83720060
tss = 0xffffffff83720050
tlb gen = 46299116
curvnet = 0
db:0:kdb.enter.default> bt
Tracing pid 0 tid 100065 td 0xfffff80004adf740
kdb_enter() at kdb_enter+0x37/frame 0xfffffe00004ed6d0
vpanic() at vpanic+0x197/frame 0xfffffe00004ed720
panic() at panic+0x43/frame 0xfffffe00004ed780
trap_fatal() at trap_fatal+0x391/frame 0xfffffe00004ed7e0
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00004ed830
trap() at trap+0x286/frame 0xfffffe00004ed940
calltrap() at calltrap+0x8/frame 0xfffffe00004ed940
--- trap 0xc, rip = 0xffffffff80e2105b, rsp = 0xfffffe00004eda10, rbp = 0xfffffe00004eda20 ---
m_tag_delete_chain() at m_tag_delete_chain+0x5b/frame 0xfffffe00004eda20
uma_zfree_arg() at uma_zfree_arg+0x3a/frame 0xfffffe00004eda80
m_freem() at m_freem+0x9b/frame 0xfffffe00004edaa0
bge_txeof() at bge_txeof+0x5d/frame 0xfffffe00004edad0
bge_intr_task() at bge_intr_task+0x1e4/frame 0xfffffe00004edb20
taskqueue_run_locked() at taskqueue_run_locked+0x144/frame 0xfffffe00004edb80
taskqueue_thread_loop() at taskqueue_thread_loop+0xb6/frame 0xfffffe00004edbb0
fork_exit() at fork_exit+0x7e/frame 0xfffffe00004edbf0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00004edbf0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---I can't really do any further testing as it's in a customer site and as both boxes crash it creates some connectivity problems before the second one reboots which takes a good few minutes (DL380 G8s are not blazing fast a booting). I don't know if this is a redmine but I have seen this thread:
https://redmine.pfsense.org/issues/5383
however I don't think it is directly related, though it could be indirectly related. Maybe someone can hopefully reproduce and confirm it.