Kernel Panic - bxe Driver - Broadcom 10Gb/s NIC

MarkFarr

Hello, I recently installed a dual port Broadcom chip based PCI-X network card in my hardware firewall. This was approximately 1 week ago.

Initially I was using only one port connected to my WAN and the machine ran stable for over one week. I was also running a custom compiled kernel module for the network card. Two days ago I configured the second port to connect to my LAN and it ran stable for one day. Today I have experienced three kernel panics so far. After the first kernel panic I removed the line that loads the module in /boot/loader.conf.local and confirmed the module was loaded and unloaded using a kldstat. The second and third kernel panic was using the default kernel driver for this Broadcom chipset.

This is part of the msgbuf.txt from the dump files.

Sleeping thread (tid 100120, pid 18361) owns a non-sleepable lock
KDB: stack backtrace of thread 100120:
sched_switch() at sched_switch+0x8ad/frame 0xfffffe04617932e0
mi_switch() at mi_switch+0xe6/frame 0xfffffe0461793310
sleepq_wait() at sleepq_wait+0x2c/frame 0xfffffe0461793340
_sx_xlock_hard() at _sx_xlock_hard+0x306/frame 0xfffffe04617933f0
bxe_ioctl() at bxe_ioctl+0x689/frame 0xfffffe0461793440
if_delmulti() at if_delmulti+0x125/frame 0xfffffe0461793480
vlan_setmulti() at vlan_setmulti+0x43/frame 0xfffffe04617934c0
vlan_ioctl() at vlan_ioctl+0x8c/frame 0xfffffe0461793540
inp_setmoptions() at inp_setmoptions+0x1711/frame 0xfffffe0461793710
ip_ctloutput() at ip_ctloutput+0x11d/frame 0xfffffe0461793760
rip_ctloutput() at rip_ctloutput+0x133/frame 0xfffffe0461793790
sosetopt() at sosetopt+0xb2/frame 0xfffffe04617937f0
kern_setsockopt() at kern_setsockopt+0xca/frame 0xfffffe0461793860
sys_setsockopt() at sys_setsockopt+0x24/frame 0xfffffe0461793880
amd64_syscall() at amd64_syscall+0xa38/frame 0xfffffe04617939b0
fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe04617939b0
--- syscall (105, FreeBSD ELF64, sys_setsockopt), rip = 0x80093195a, rsp = 0x7fffffffea28, rbp = 0x7fffffffea70 ---
panic: sleeping thread
cpuid = 2
KDB: enter: panic

From my limited understanding of the log, it seems I am experiencing the same issues in these threads from four years ago.

https://redmine.pfsense.org/issues/4685

https://forum.netgate.com/topic/87506/pfsense-2-2-x-panics-with-sleeping-thread-owns-a-non-sleepable-lock

As far as I can tell I am not running an ARP Proxy, and the bug was resolved in the 2.2.x branch of pfSense.

Can anyone provide any insight into what may have caused this?

Attached are the two set of dump files with the custom kernel module (0) and the default kernel driver (2).

textdump.0.tar
textdump.2.tar

Thank you in advance for any help provided.
Mark.

stephenw10

Hmm, identical backtraces, definitely looks like a software issue:

db:0:kdb.enter.default>  show pcpu
cpuid        = 2
dynamic pcpu = 0xfffffe045c2a8380
curthread    = 0xfffff80007465620: pid 12 "swi1: netisr 4"
curpcb       = 0xfffffe03db1c3a80
fpcurthread  = none
idlethread   = 0xfffff800073ac000: tid 100005 "idle: cpu2"
curpmap      = 0xffffffff82b85998
tssp         = 0xffffffff82bb68e0
commontssp   = 0xffffffff82bb68e0
rsp0         = 0xfffffe03db1c3a80
gs32p        = 0xffffffff82bbd138
ldt          = 0xffffffff82bbd178
tss          = 0xffffffff82bbd168
db:0:kdb.enter.default>  bt
Tracing pid 12 tid 100032 td 0xfffff80007465620
kdb_enter() at kdb_enter+0x3b/frame 0xfffffe03db1c3510
vpanic() at vpanic+0x194/frame 0xfffffe03db1c3570
panic() at panic+0x43/frame 0xfffffe03db1c35d0
propagate_priority() at propagate_priority+0x2b2/frame 0xfffffe03db1c3600
turnstile_wait() at turnstile_wait+0x319/frame 0xfffffe03db1c3650
__rw_rlock_hard() at __rw_rlock_hard+0x292/frame 0xfffffe03db1c36e0
rip_input() at rip_input+0x2bb/frame 0xfffffe03db1c3750
igmp_input() at igmp_input+0x173/frame 0xfffffe03db1c3810
ip_input() at ip_input+0x139/frame 0xfffffe03db1c3870
swi_net() at swi_net+0x143/frame 0xfffffe03db1c38e0
intr_event_execute_handlers() at intr_event_execute_handlers+0xe9/frame 0xfffffe03db1c3920
ithread_loop() at ithread_loop+0xe7/frame 0xfffffe03db1c3970
fork_exit() at fork_exit+0x83/frame 0xfffffe03db1c39b0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe03db1c39b0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
db:0:kdb.enter.default>  ps

And you only see that when both ports are assigned and in use?

It only links at 2.5G with the custom driver I assume? What was the second port being used for?

Steve

MarkFarr

Hi Stephen, thank you for replying.

And you only see that when both ports are assigned and in use?

Yes today was the first time I have ever experienced a kernel panic with pfSense. and I have run the distribution now for about 5 years.

It only links at 2.5G with the custom driver I assume? What was the second port being used for?

yes the custom driver was made to squeeze even more speed out of our 1.5Gbit/s fiber to the home lines. I have since removed the custom kernel module.

The second port on the card was not being used initially, because I was not able to figure out why it was not connecting to VLAN 1 by default. I had to explicitly create and assign VLAN 1 to second port (bxe1.1).

The way my pfSense server is connected to the internet is that the Bell provided GPON module is inserted into a Ubiquiti ES-16-XG switch on Port 1, and that module negotiates to a speed of 2.5 Gbps. I then have a SFP+ DAC going from port 2 on the switch to the Broadcom card in my pfSense server which negotiates to a speed of 10 Gbps. I think with my current setup I am not reaping the benefits of the custom driver.

Therefore I have Internet on VLAN 35 on Ports 1, 2 and 13 of the switch, and I have VLAN 1 on the same switch on ports 11, 12, 15 and 16 for LAN access. Both ports on the pfSense server are connected to the same switch but on explicit VLANs. These VLANs are not trunked together.

I will try posting an image here of the switches VLANs.

LAN

WAN

One other thing I wanted to add, is that I was running TCPDumps on both bxe0 (WAN) and bxe1 (LAN) over the weekend also trying to figure out why my IPTV Service was not behaving correctly.

I hope this information helps.

stephenw10

Hmm, well I would definitely not use VLAN1. Better to not ever use it as a tagged VLAN. It's hard to imagine the card would balk at it but it will not have been tested. If one if the ports was using it and you still have VLAN hardware tagging off-loading enabled I could just about imagine that as an issue.

Yes, in that setup you would not be taking advantage of the driver. Though if the switch port can negotiate at 2.5Gb you're not losing anything either. The intention though is to have the Bell module directly in the Broadcom card I believe. I have no way to test that. I can only dream of those speeds!

Steve

MarkFarr

@stephenw10

VLAN hardware tagging off-loading enabled

I am unfamiliar with this option. I don't see it in the System -> Advanced -> Networking section nor in the System Tunables. Is this a driver specific option?

I don't see anything mentioned that is similar in the man page for the driver.
https://man.openbsd.org/FreeBSD-11.1/bxe.4

Thank you again for your ongoing help.

stephenw10

Check the ifconfig output for the bxe NICs for things like VLAN_HWTAGGING,VLAN_HWCSUM,VLAN_HWFILTER.
There's no GUI knob for that but you can disable it if required. I'm not aware of any issue with it but no-one use VLAN1 so...

Steve