Kernel Panic - bxe Driver - Broadcom 10Gb/s NIC



  • Hello, I recently installed a dual port Broadcom chip based PCI-X network card in my hardware firewall. This was approximately 1 week ago.

    Initially I was using only one port connected to my WAN and the machine ran stable for over one week. I was also running a custom compiled kernel module for the network card. Two days ago I configured the second port to connect to my LAN and it ran stable for one day. Today I have experienced three kernel panics so far. After the first kernel panic I removed the line that loads the module in /boot/loader.conf.local and confirmed the module was loaded and unloaded using a kldstat. The second and third kernel panic was using the default kernel driver for this Broadcom chipset.

    This is part of the msgbuf.txt from the dump files.

    Sleeping thread (tid 100120, pid 18361) owns a non-sleepable lock
    KDB: stack backtrace of thread 100120:
    sched_switch() at sched_switch+0x8ad/frame 0xfffffe04617932e0
    mi_switch() at mi_switch+0xe6/frame 0xfffffe0461793310
    sleepq_wait() at sleepq_wait+0x2c/frame 0xfffffe0461793340
    _sx_xlock_hard() at _sx_xlock_hard+0x306/frame 0xfffffe04617933f0
    bxe_ioctl() at bxe_ioctl+0x689/frame 0xfffffe0461793440
    if_delmulti() at if_delmulti+0x125/frame 0xfffffe0461793480
    vlan_setmulti() at vlan_setmulti+0x43/frame 0xfffffe04617934c0
    vlan_ioctl() at vlan_ioctl+0x8c/frame 0xfffffe0461793540
    inp_setmoptions() at inp_setmoptions+0x1711/frame 0xfffffe0461793710
    ip_ctloutput() at ip_ctloutput+0x11d/frame 0xfffffe0461793760
    rip_ctloutput() at rip_ctloutput+0x133/frame 0xfffffe0461793790
    sosetopt() at sosetopt+0xb2/frame 0xfffffe04617937f0
    kern_setsockopt() at kern_setsockopt+0xca/frame 0xfffffe0461793860
    sys_setsockopt() at sys_setsockopt+0x24/frame 0xfffffe0461793880
    amd64_syscall() at amd64_syscall+0xa38/frame 0xfffffe04617939b0
    fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe04617939b0
    --- syscall (105, FreeBSD ELF64, sys_setsockopt), rip = 0x80093195a, rsp = 0x7fffffffea28, rbp = 0x7fffffffea70 ---
    panic: sleeping thread
    cpuid = 2
    KDB: enter: panic
    

    From my limited understanding of the log, it seems I am experiencing the same issues in these threads from four years ago.

    https://redmine.pfsense.org/issues/4685

    https://forum.netgate.com/topic/87506/pfsense-2-2-x-panics-with-sleeping-thread-owns-a-non-sleepable-lock

    As far as I can tell I am not running an ARP Proxy, and the bug was resolved in the 2.2.x branch of pfSense.

    Can anyone provide any insight into what may have caused this?

    Attached are the two set of dump files with the custom kernel module (0) and the default kernel driver (2).

    textdump.0.tar
    textdump.2.tar

    Thank you in advance for any help provided.
    Mark.


  • Netgate Administrator

    Hmm, identical backtraces, definitely looks like a software issue:

    db:0:kdb.enter.default>  show pcpu
    cpuid        = 2
    dynamic pcpu = 0xfffffe045c2a8380
    curthread    = 0xfffff80007465620: pid 12 "swi1: netisr 4"
    curpcb       = 0xfffffe03db1c3a80
    fpcurthread  = none
    idlethread   = 0xfffff800073ac000: tid 100005 "idle: cpu2"
    curpmap      = 0xffffffff82b85998
    tssp         = 0xffffffff82bb68e0
    commontssp   = 0xffffffff82bb68e0
    rsp0         = 0xfffffe03db1c3a80
    gs32p        = 0xffffffff82bbd138
    ldt          = 0xffffffff82bbd178
    tss          = 0xffffffff82bbd168
    db:0:kdb.enter.default>  bt
    Tracing pid 12 tid 100032 td 0xfffff80007465620
    kdb_enter() at kdb_enter+0x3b/frame 0xfffffe03db1c3510
    vpanic() at vpanic+0x194/frame 0xfffffe03db1c3570
    panic() at panic+0x43/frame 0xfffffe03db1c35d0
    propagate_priority() at propagate_priority+0x2b2/frame 0xfffffe03db1c3600
    turnstile_wait() at turnstile_wait+0x319/frame 0xfffffe03db1c3650
    __rw_rlock_hard() at __rw_rlock_hard+0x292/frame 0xfffffe03db1c36e0
    rip_input() at rip_input+0x2bb/frame 0xfffffe03db1c3750
    igmp_input() at igmp_input+0x173/frame 0xfffffe03db1c3810
    ip_input() at ip_input+0x139/frame 0xfffffe03db1c3870
    swi_net() at swi_net+0x143/frame 0xfffffe03db1c38e0
    intr_event_execute_handlers() at intr_event_execute_handlers+0xe9/frame 0xfffffe03db1c3920
    ithread_loop() at ithread_loop+0xe7/frame 0xfffffe03db1c3970
    fork_exit() at fork_exit+0x83/frame 0xfffffe03db1c39b0
    fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe03db1c39b0
    --- trap 0, rip = 0, rsp = 0, rbp = 0 ---
    db:0:kdb.enter.default>  ps
    

    And you only see that when both ports are assigned and in use?

    It only links at 2.5G with the custom driver I assume? What was the second port being used for?

    Steve



  • Hi Stephen, thank you for replying.

    And you only see that when both ports are assigned and in use?

    Yes today was the first time I have ever experienced a kernel panic with pfSense. and I have run the distribution now for about 5 years.

    It only links at 2.5G with the custom driver I assume? What was the second port being used for?

    yes the custom driver was made to squeeze even more speed out of our 1.5Gbit/s fiber to the home lines. I have since removed the custom kernel module.

    The second port on the card was not being used initially, because I was not able to figure out why it was not connecting to VLAN 1 by default. I had to explicitly create and assign VLAN 1 to second port (bxe1.1).

    The way my pfSense server is connected to the internet is that the Bell provided GPON module is inserted into a Ubiquiti ES-16-XG switch on Port 1, and that module negotiates to a speed of 2.5 Gbps. I then have a SFP+ DAC going from port 2 on the switch to the Broadcom card in my pfSense server which negotiates to a speed of 10 Gbps. I think with my current setup I am not reaping the benefits of the custom driver.

    Therefore I have Internet on VLAN 35 on Ports 1, 2 and 13 of the switch, and I have VLAN 1 on the same switch on ports 11, 12, 15 and 16 for LAN access. Both ports on the pfSense server are connected to the same switch but on explicit VLANs. These VLANs are not trunked together.

    I will try posting an image here of the switches VLANs.

    LAN

    WAN

    One other thing I wanted to add, is that I was running TCPDumps on both bxe0 (WAN) and bxe1 (LAN) over the weekend also trying to figure out why my IPTV Service was not behaving correctly.

    I hope this information helps.


  • Netgate Administrator

    Hmm, well I would definitely not use VLAN1. Better to not ever use it as a tagged VLAN. It's hard to imagine the card would balk at it but it will not have been tested. If one if the ports was using it and you still have VLAN hardware tagging off-loading enabled I could just about imagine that as an issue.

    Yes, in that setup you would not be taking advantage of the driver. Though if the switch port can negotiate at 2.5Gb you're not losing anything either. The intention though is to have the Bell module directly in the Broadcom card I believe. I have no way to test that. I can only dream of those speeds! 😉

    Steve



  • @stephenw10

    VLAN hardware tagging off-loading enabled

    I am unfamiliar with this option. I don't see it in the System -> Advanced -> Networking section nor in the System Tunables. Is this a driver specific option?

    I don't see anything mentioned that is similar in the man page for the driver.
    https://man.openbsd.org/FreeBSD-11.1/bxe.4

    Thank you again for your ongoing help.


  • Netgate Administrator

    Check the ifconfig output for the bxe NICs for things like VLAN_HWTAGGING,VLAN_HWCSUM,VLAN_HWFILTER.
    There's no GUI knob for that but you can disable it if required. I'm not aware of any issue with it but no-one use VLAN1 so...

    Steve


Log in to reply