Proper procedure for adding a NIC kernel module? (qlnxe)
-
I have a server with a Qlogic FastLinQ 41000. For some reason it was not detected out of the box when I initially installed CE 2.7.0. I setup our network using a LAGG of 2 Intel I350 instead and got things running. Fast forward a couple months to now on CE 2.7.2 and I'm looking into why the Qlogic card isn't being detected. Check pciconf and it's there at the bottom as a "none" device. Do some googling and find the qlnxe driver (https://man.freebsd.org/cgi/man.cgi?query=qlnxe&apropos=0&sektion=4&manpath=FreeBSD+14.0-RELEASE&arch=default&format=html). Looks like I just need a kernel module to get this thing going. "kldload if_qlnxe" to see if it gets detected, but instead it crashes with a page fault. More googling and I found this page on the forum (https://forum.netgate.com/post/1037980) and it says I need to create a loader.conf.local after install, before reboot.
My question boils down to did I cause the crash by using kldload after boot, or is there a greater issue with my system and the qlnxe module? Is there a specific procedure for adding a module like this?
-
Did it fail to load the module when you ran it from the CLI?
If you run
kldstat
you should see it loaded.What was the crash after adding the loader value? Do you have a crash report?
Any reason you're not running 2.7.2?
Steve
-
@stephenw10 said in Proper procedure for adding a NIC kernel module? (qlnxe):
Did it fail to load the module when you ran it from the CLI?
If you run kldstat you should see it loaded.
I think it started to load it and crashed. Locked up such that it wouldn't accept any keyboard input to check kldstat. We power cycled the machine which cleared it out and everything came back up as before.
@stephenw10 said in Proper procedure for adding a NIC kernel module? (qlnxe):
What was the crash after adding the loader value? Do you have a crash report?
I have the crash report, just wanted to know if "that should have worked" before I bothered people with it.
@stephenw10 said in Proper procedure for adding a NIC kernel module? (qlnxe):
Any reason you're not running 2.7.2?
We are, but we started at 2.7.0 and have upgraded a few times. Just want to include that in case the upgrade path did something weird.
-
Ah OK well I'd check the backtrace in the crash report first. It may be a know bug in the driver.
-
@stephenw10 said in Proper procedure for adding a NIC kernel module? (qlnxe):
Do you have a crash report?
-
@stephenw10 said in Proper procedure for adding a NIC kernel module? (qlnxe):
Ah OK well I'd check the backtrace in the crash report first. It may be a know bug in the driver.
I think the problem is at the ??(), That seems like a weird function name to me.
-
Yup so that definitely crashed trying to attach the driver:
db:0:kdb.enter.default> bt Tracing pid 55563 tid 116689 td 0xfffffe0382ce93a0 kdb_enter() at kdb_enter+0x32/frame 0xfffffe03c0298300 vpanic() at vpanic+0x163/frame 0xfffffe03c0298430 panic() at panic+0x43/frame 0xfffffe03c0298490 trap_fatal() at trap_fatal+0x40c/frame 0xfffffe03c02984f0 trap_pfault() at trap_pfault+0x4f/frame 0xfffffe03c0298550 calltrap() at calltrap+0x8/frame 0xfffffe03c0298550 --- trap 0xc, rip = 0, rsp = 0xfffffe03c0298628, rbp = 0xfffffe03c0298650 --- ??() at 0/frame 0xfffffe03c0298650 dump_iface() at dump_iface+0x145/frame 0xfffffe03c0298700 rtnl_handle_ifevent() at rtnl_handle_ifevent+0xa9/frame 0xfffffe03c0298780 if_attach_internal() at if_attach_internal+0x3cf/frame 0xfffffe03c02987d0 ether_ifattach() at ether_ifattach+0x2c/frame 0xfffffe03c0298810 qlnx_init_ifnet() at qlnx_init_ifnet+0x2c6/frame 0xfffffe03c0298860 qlnx_pci_attach() at qlnx_pci_attach+0x7d9/frame 0xfffffe03c0298900 device_attach() at device_attach+0x3be/frame 0xfffffe03c0298950 device_probe_and_attach() at device_probe_and_attach+0x41/frame 0xfffffe03c0298980 pci_driver_added() at pci_driver_added+0xf2/frame 0xfffffe03c02989c0 devclass_driver_added() at devclass_driver_added+0x39/frame 0xfffffe03c0298a00 devclass_add_driver() at devclass_add_driver+0x11e/frame 0xfffffe03c0298a40 module_register_init() at module_register_init+0x85/frame 0xfffffe03c0298a70 linker_load_module() at linker_load_module+0xbd5/frame 0xfffffe03c0298d70 kern_kldload() at kern_kldload+0x16a/frame 0xfffffe03c0298dd0 sys_kldload() at sys_kldload+0x5c/frame 0xfffffe03c0298e00 amd64_syscall() at amd64_syscall+0x109/frame 0xfffffe03c0298f30 fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe03c0298f30 --- syscall (304, FreeBSD ELF64, kldload), rip = 0x183cac2d58aa, rsp = 0x183caa53f3e8, rbp = 0x183caa53f960 ---
ql0: <Qlogic 10GbE/25GbE/40GbE PCI CNA (AH) Adapter-Ethernet Function v2.0.112> mem 0xfb820000-0xfb83ffff,0xfb000000-0xfb7fffff,0xfb850000-0xfb85ffff at device 0.0 numa-domain 1 on pci10 ql0: qlnx_set_personality: ETH_IWARP ql0: setting parameters required by iWARP dev Fatal trap 12: page fault while in kernel mode cpuid = 23; apic id = 34 fault virtual address = 0x0 fault code = supervisor read instruction, page not present instruction pointer = 0x20:0x0 stack pointer = 0x0:0xfffffe03c0298628 frame pointer = 0x0:0xfffffe03c0298650 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 55563 (kldload) rdi: fffff815bd17b800 rsi: fffffe03c02986a0 rdx: 00000000c0306938 rcx: 00000000c0306938 r8: 0000000000000000 r9: 0000000000000010 rax: 0000000000000000 rbx: fffffe03c02986a0 rbp: fffffe03c0298650 r10: 0000000000000000 r11: fffffe00e6ce8000 r12: 0000000000008802 r13: fffff81081a15810 r14: fffffe03b48fcf90 r15: 0000000000000016 trap number = 12 panic: page fault cpuid = 23 time = 1717620079 KDB: enter: panic
That doesn't appear to be a known bug: https://bugs.freebsd.org/bugzilla/buglist.cgi?quicksearch=qlnxe
-
@stephenw10 I swear I'm an edge case magnetic.
-
What is that NIC exactly?
-
@stephenw10 Exactly? I'm not sure Qlogic FastLinQ 41000 series 2 port SFP. It's a QL41132HLCU, QL41212HLCU, or QL41262HLCU going by the Qlogic datasheet. I'm betting the QL41132HLCU as we wanted 10G cards and the other 2 models are 10G/25G cards. I'll need to dig in the firmware or the purchase orders to figure it out exactly. I will get back to you.
Sounds like this is a FreeBSD issue and nothing weird I did at least. Any idea why this wasn't detected on the initial install?
-
@stephenw10 said in Proper procedure for adding a NIC kernel module? (qlnxe):
What is that NIC exactly?
My speculation was correct, it is a Qlogic FastlinQ QL41132HLCU exactly.
-
@stephenw10 I've not done any detailed digging, but there's been at least one bug fix in dump_iface() not too long ago to fix similar crashes:
commit 7d48224073ce14f0dd3db2d4e96876ac928b52f2 Author: Bjoern A. Zeeb <bz@FreeBSD.org> Date: Sat Sep 30 15:11:57 2023 +0000 netlink: fix accessing freed memory The check for if_addrlen in dump_iface() is not sufficient to determine if we still have a valid if_addr. Rather than directly accessing if_addr check the STAILQ (for the first entry). This avoids panics when destroying cloned interfaces as experienced with net80211 wlan ones. Sponsored by: The FreeBSD Foundation MFC after: 3 days Reviewed by: jhibbits (earlier version), kp Differential Revision: https://reviews.freebsd.org/D42027
It's certainly worth testing a 2.8 snapshot before we dig deeper.
-
@kprovost said in Proper procedure for adding a NIC kernel module? (qlnxe):
It's certainly worth testing a 2.8 snapshot before we dig deeper.
Would that fix be in the latest PF+? This is a production machine with lots of work happening, but I'm poking my management chain about paying for support.
-
@GeorgePatches That particular patch is in 24.03, yes.
-
Hmm, I wonder if we can do something to avoid that bug as a test.
-
@stephenw10 Hmmmmm, a thought is that it blew up on the dummynet code. I can try ripping the limiters out and see it doesn't blow up.
-
This thought was wrong, it blew up exactly the same without limiters and the dummynet modules not loaded.
-
Well one thing ruled out I guess!
-
There's no easy way to like try a 2.8 snap and then roll back to 2.7.2, right? You can do that with PF+, if I understand the bootloader thing correctly?
I ask because management has approved our initial request for a support contract. We're currently waiting on a quote and then actual approval and purchasing. I'm ok putting a pin in this until it's easier to test a snap and roll back. This card is a nice to have, we're currently "doing fine" with our LAGG'd gigabit links.
-
You can manually create ZFS snapshots at the CLI in CE, assuming you are running ZFS. However there are no public 2.8-dev snapshots yet.