Kernel Panic - bxe Driver - Broadcom 10Gb/s NIC
-
Hello, I recently installed a dual port Broadcom chip based PCI-X network card in my hardware firewall. This was approximately 1 week ago.
Initially I was using only one port connected to my WAN and the machine ran stable for over one week. I was also running a custom compiled kernel module for the network card. Two days ago I configured the second port to connect to my LAN and it ran stable for one day. Today I have experienced three kernel panics so far. After the first kernel panic I removed the line that loads the module in /boot/loader.conf.local and confirmed the module was loaded and unloaded using a kldstat. The second and third kernel panic was using the default kernel driver for this Broadcom chipset.
This is part of the msgbuf.txt from the dump files.
Sleeping thread (tid 100120, pid 18361) owns a non-sleepable lock KDB: stack backtrace of thread 100120: sched_switch() at sched_switch+0x8ad/frame 0xfffffe04617932e0 mi_switch() at mi_switch+0xe6/frame 0xfffffe0461793310 sleepq_wait() at sleepq_wait+0x2c/frame 0xfffffe0461793340 _sx_xlock_hard() at _sx_xlock_hard+0x306/frame 0xfffffe04617933f0 bxe_ioctl() at bxe_ioctl+0x689/frame 0xfffffe0461793440 if_delmulti() at if_delmulti+0x125/frame 0xfffffe0461793480 vlan_setmulti() at vlan_setmulti+0x43/frame 0xfffffe04617934c0 vlan_ioctl() at vlan_ioctl+0x8c/frame 0xfffffe0461793540 inp_setmoptions() at inp_setmoptions+0x1711/frame 0xfffffe0461793710 ip_ctloutput() at ip_ctloutput+0x11d/frame 0xfffffe0461793760 rip_ctloutput() at rip_ctloutput+0x133/frame 0xfffffe0461793790 sosetopt() at sosetopt+0xb2/frame 0xfffffe04617937f0 kern_setsockopt() at kern_setsockopt+0xca/frame 0xfffffe0461793860 sys_setsockopt() at sys_setsockopt+0x24/frame 0xfffffe0461793880 amd64_syscall() at amd64_syscall+0xa38/frame 0xfffffe04617939b0 fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe04617939b0 --- syscall (105, FreeBSD ELF64, sys_setsockopt), rip = 0x80093195a, rsp = 0x7fffffffea28, rbp = 0x7fffffffea70 --- panic: sleeping thread cpuid = 2 KDB: enter: panic
From my limited understanding of the log, it seems I am experiencing the same issues in these threads from four years ago.
https://redmine.pfsense.org/issues/4685
https://forum.netgate.com/topic/87506/pfsense-2-2-x-panics-with-sleeping-thread-owns-a-non-sleepable-lock
As far as I can tell I am not running an ARP Proxy, and the bug was resolved in the 2.2.x branch of pfSense.
Can anyone provide any insight into what may have caused this?
Attached are the two set of dump files with the custom kernel module (0) and the default kernel driver (2).
Thank you in advance for any help provided.
Mark. -
Hmm, identical backtraces, definitely looks like a software issue:
db:0:kdb.enter.default> show pcpu cpuid = 2 dynamic pcpu = 0xfffffe045c2a8380 curthread = 0xfffff80007465620: pid 12 "swi1: netisr 4" curpcb = 0xfffffe03db1c3a80 fpcurthread = none idlethread = 0xfffff800073ac000: tid 100005 "idle: cpu2" curpmap = 0xffffffff82b85998 tssp = 0xffffffff82bb68e0 commontssp = 0xffffffff82bb68e0 rsp0 = 0xfffffe03db1c3a80 gs32p = 0xffffffff82bbd138 ldt = 0xffffffff82bbd178 tss = 0xffffffff82bbd168 db:0:kdb.enter.default> bt Tracing pid 12 tid 100032 td 0xfffff80007465620 kdb_enter() at kdb_enter+0x3b/frame 0xfffffe03db1c3510 vpanic() at vpanic+0x194/frame 0xfffffe03db1c3570 panic() at panic+0x43/frame 0xfffffe03db1c35d0 propagate_priority() at propagate_priority+0x2b2/frame 0xfffffe03db1c3600 turnstile_wait() at turnstile_wait+0x319/frame 0xfffffe03db1c3650 __rw_rlock_hard() at __rw_rlock_hard+0x292/frame 0xfffffe03db1c36e0 rip_input() at rip_input+0x2bb/frame 0xfffffe03db1c3750 igmp_input() at igmp_input+0x173/frame 0xfffffe03db1c3810 ip_input() at ip_input+0x139/frame 0xfffffe03db1c3870 swi_net() at swi_net+0x143/frame 0xfffffe03db1c38e0 intr_event_execute_handlers() at intr_event_execute_handlers+0xe9/frame 0xfffffe03db1c3920 ithread_loop() at ithread_loop+0xe7/frame 0xfffffe03db1c3970 fork_exit() at fork_exit+0x83/frame 0xfffffe03db1c39b0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe03db1c39b0 --- trap 0, rip = 0, rsp = 0, rbp = 0 --- db:0:kdb.enter.default> ps
And you only see that when both ports are assigned and in use?
It only links at 2.5G with the custom driver I assume? What was the second port being used for?
Steve
-
Hi Stephen, thank you for replying.
And you only see that when both ports are assigned and in use?
Yes today was the first time I have ever experienced a kernel panic with pfSense. and I have run the distribution now for about 5 years.
It only links at 2.5G with the custom driver I assume? What was the second port being used for?
yes the custom driver was made to squeeze even more speed out of our 1.5Gbit/s fiber to the home lines. I have since removed the custom kernel module.
The second port on the card was not being used initially, because I was not able to figure out why it was not connecting to VLAN 1 by default. I had to explicitly create and assign VLAN 1 to second port (bxe1.1).
The way my pfSense server is connected to the internet is that the Bell provided GPON module is inserted into a Ubiquiti ES-16-XG switch on Port 1, and that module negotiates to a speed of 2.5 Gbps. I then have a SFP+ DAC going from port 2 on the switch to the Broadcom card in my pfSense server which negotiates to a speed of 10 Gbps. I think with my current setup I am not reaping the benefits of the custom driver.
Therefore I have Internet on VLAN 35 on Ports 1, 2 and 13 of the switch, and I have VLAN 1 on the same switch on ports 11, 12, 15 and 16 for LAN access. Both ports on the pfSense server are connected to the same switch but on explicit VLANs. These VLANs are not trunked together.
I will try posting an image here of the switches VLANs.
One other thing I wanted to add, is that I was running TCPDumps on both bxe0 (WAN) and bxe1 (LAN) over the weekend also trying to figure out why my IPTV Service was not behaving correctly.
I hope this information helps.
-
Hmm, well I would definitely not use VLAN1. Better to not ever use it as a tagged VLAN. It's hard to imagine the card would balk at it but it will not have been tested. If one if the ports was using it and you still have VLAN hardware tagging off-loading enabled I could just about imagine that as an issue.
Yes, in that setup you would not be taking advantage of the driver. Though if the switch port can negotiate at 2.5Gb you're not losing anything either. The intention though is to have the Bell module directly in the Broadcom card I believe. I have no way to test that. I can only dream of those speeds!
Steve
-
VLAN hardware tagging off-loading enabled
I am unfamiliar with this option. I don't see it in the System -> Advanced -> Networking section nor in the System Tunables. Is this a driver specific option?
I don't see anything mentioned that is similar in the man page for the driver.
https://man.openbsd.org/FreeBSD-11.1/bxe.4Thank you again for your ongoing help.
-
Check the ifconfig output for the bxe NICs for things like
VLAN_HWTAGGING,VLAN_HWCSUM,VLAN_HWFILTER
.
There's no GUI knob for that but you can disable it if required. I'm not aware of any issue with it but no-one use VLAN1 so...Steve