Mellanox ConnectX-4 LX causing hard panic on boot intermittently
-
Hi all, first post (please move if I'm in the wrong forum)
I'm trying to evaluate the latest version 2.7.2 with a ConnectX-4 LX MCX4121A-ACAT 25G card, but i suspect freeBSD is causing problems. Occasionally boot works fine, other times I get a 10-15minute cycle of timeouts, and occasionally this results in kernel panic and reboot. Anyone seeing this? The only result I can find on google is to redmine.
2024-02-09T12:18:44 Notice kernel <6>mce0: link state changed to DOWN 2024-02-09T12:18:44 Notice kernel <6>mce0: Ethernet address: 50:6b:4b:28:7f:af 2024-02-09T12:18:44 Notice kernel mlx5_core0: Failed to initialize SR-IOV support, error 2 2024-02-09T12:18:44 Notice kernel mlx5_core: INFO: (mlx5_core0): E-Switch: Total vports 1, l2 table size(65536), per vport: max uc(128) max mc(2048) 2024-02-09T12:18:44 Notice kernel mlx5_core0: INFO: init_one:1658:(pid 0): cannot find SR-IOV PCIe cap 2024-02-09T12:18:44 Notice kernel mlx5_core0: INFO: mlx5_port_module_event:705:(pid 12): Module 1, status: plugged and enabled 2024-02-09T12:18:44 Notice kernel mlx5_core0: <mlx5_core> mem 0xf8000000-0xf9ffffff irq 17 at device 0.1 on pci2 2024-02-09T12:18:44 Notice kernel pci0:2:0:0: Device leaked 11 MSI-X vectors 2024-02-09T12:18:44 Notice kernel pci0:2:0:0: Device leaked IRQ resources 2024-02-09T12:18:44 Notice kernel device_attach: mlx5_core0 attach returned 60 2024-02-09T12:18:44 Notice kernel mlx5_core0: ERR: init_one:1644:(pid 0): mlx5_load_one failed -60 2024-02-09T12:18:44 Notice kernel mlx5_core0: ERR: mlx5_load_one:1231:(pid 0): tear_down_hca failed, skip cleanup 2024-02-09T12:18:44 Notice kernel mlx5_core0: WARN: wait_func:965:(pid 0): TEARDOWN_HCA(0x103) timeout. Will cause a leak of a command resource 2024-02-09T12:18:44 Notice kernel mlx5_core0: WARN: up_rel_func:87:(pid 0): failed to free uar index 16 2024-02-09T12:18:44 Notice kernel mlx5_core0: WARN: wait_func:965:(pid 0): DEALLOC_UAR(0x803) timeout. Will cause a leak of a command resource 2024-02-09T12:18:44 Notice kernel mlx5_core0: WARN: mlx5_destroy_unmap_eq:517:(pid 0): failed to destroy a previously created eq: eqn 6 2024-02-09T12:18:44 Notice kernel mlx5_core0: WARN: wait_func:965:(pid 0): DESTROY_EQ(0x302) timeout. Will cause a leak of a command resource 2024-02-09T12:18:44 Notice kernel mlx5_core0: ERR: mlx5_load_one:1163:(pid 0): Failed to alloc completion EQs 2024-02-09T12:18:44 Notice kernel mlx5_core0: WARN: free_comp_eqs:669:(pid 0): failed to destroy EQ 0xa 2024-02-09T12:18:44 Notice kernel mlx5_core0: WARN: mlx5_destroy_unmap_eq:517:(pid 0): failed to destroy a previously created eq: eqn 10 2024-02-09T12:18:44 Notice kernel mlx5_core0: WARN: wait_func:965:(pid 0): DESTROY_EQ(0x302) timeout. Will cause a leak of a command resource 2024-02-09T12:18:44 Notice kernel mlx5_core0: WARN: free_comp_eqs:669:(pid 0): failed to destroy EQ 0x9 2024-02-09T12:18:44 Notice kernel mlx5_core0: WARN: mlx5_destroy_unmap_eq:517:(pid 0): failed to destroy a previously created eq: eqn 9 2024-02-09T12:18:44 Notice kernel mlx5_core0: WARN: wait_func:965:(pid 0): DESTROY_EQ(0x302) timeout. Will cause a leak of a command resource 2024-02-09T12:18:44 Notice kernel mlx5_core0: WARN: pages_work_handler:473:(pid 0): reclaim fail -60 2024-02-09T12:18:44 Notice kernel mlx5_core0: ERR: reclaim_pages:442:(pid 0): failed reclaiming pages 2024-02-09T12:18:44 Notice kernel mlx5_core0: WARN: wait_func:965:(pid 0): MANAGE_PAGES(0x108) timeout. Will cause a leak of a command resource 2024-02-09T12:18:44 Notice kernel mlx5_core0: WARN: free_comp_eqs:669:(pid 0): failed to destroy EQ 0x8 2024-02-09T12:18:44 Notice kernel mlx5_core0: WARN: mlx5_destroy_unmap_eq:517:(pid 0): failed to destroy a previously created eq: eqn 8 2024-02-09T12:18:44 Notice kernel mlx5_core0: WARN: wait_func:965:(pid 0): DESTROY_EQ(0x302) timeout. Will cause a leak of a command resource 2024-02-09T12:18:44 Notice kernel mlx5_core0: WARN: pages_work_handler:473:(pid 0): reclaim fail -60 2024-02-09T12:18:44 Notice kernel mlx5_core0: ERR: reclaim_pages:442:(pid 0): failed reclaiming pages 2024-02-09T12:18:44 Notice kernel mlx5_core0: WARN: wait_func:965:(pid 0): MANAGE_PAGES(0x108) timeout. Will cause a leak of a command resource 2024-02-09T12:18:44 Notice kernel mlx5_core0: WARN: free_comp_eqs:669:(pid 0): failed to destroy EQ 0x7 2024-02-09T12:18:44 Notice kernel mlx5_core0: WARN: mlx5_destroy_unmap_eq:517:(pid 0): failed to destroy a previously created eq: eqn 7 2024-02-09T12:18:44 Notice kernel mlx5_core0: WARN: wait_func:965:(pid 0): DESTROY_EQ(0x302) timeout. Will cause a leak of a command resource 2024-02-09T12:18:44 Notice kernel mlx5_core0: WARN: pages_work_handler:473:(pid 0): give fail -60 2024-02-09T12:18:44 Notice kernel mlx5_core0: WARN: give_pages:373:(pid 0): page notify failed 2024-02-09T12:18:44 Notice kernel mlx5_core0: WARN: wait_func:965:(pid 0): MANAGE_PAGES(0x108) timeout. Will cause a leak of a command resource 2024-02-09T12:18:44 Notice kernel mlx5_core0: WARN: wait_func:965:(pid 0): CREATE_EQ(0x301) timeout. Will cause a leak of a command resource 2024-02-09T12:18:44 Notice kernel mlx5_core0: WARN: give_pages:352:(pid 0): func_id 0x0, npages 1241, err -60 2024-02-09T12:18:44 Notice kernel mlx5_core0: WARN: wait_func:965:(pid 0): MANAGE_PAGES(0x108) timeout. Will cause a leak of a command resource 2024-02-09T12:18:44 Notice kernel mlx5: Mellanox Core driver 3.7.1 (November 2021)ugen7.1: <Intel EHCI root HUB> at usbus7 2024-02-09T12:18:44 Notice kernel mlx5_core0: <mlx5_core> mem 0xf6000000-0xf7ffffff irq 16 at device 0.0 on pci2
https://redmine.pfsense.org/issues/14180 (not my report but similar)
-
Do you have the actual panic? Crash report?
Those logs are all from the same second. Are you saying it continues those same errors for 10 mins then panics? Never boots further?
-
@stephenw10 Thanks for your reply. Sorry to ask such a beginner question, but where does pfsense put a crash report if the boot process never makes it to the login terminal?
I will post correct ordered dmesg with the time stamps after I reboot a few times. It seems pretty easy to trigger.
-
Hmm well spun up a clean VM to test this out and it crashed while trying to boot the ISO. Hmm.
Second boot & install was able to produce the long timeouts - still not sure where a crash report should be found.The relevant bits are:
Feb 10 23:23:37 pfSense kernel: mlx5_core0: <mlx5_core> mem 0xf8000000-0xf9ffffff irq 16 at device 0.0 on pci2 Feb 10 23:23:37 pfSense kernel: mlx5: Mellanox Core driver 3.7.1 (November 2021)uhub1 on usbus3 Feb 10 23:23:37 pfSense kernel: uhub1: <Intel EHCI root HUB, class 9/0, rev 2.00/1.00, addr 1> on usbus3 Feb 10 23:23:37 pfSense kernel: uhub2 on usbus7 Feb 10 23:23:37 pfSense kernel: uhub2: <Intel EHCI root HUB, class 9/0, rev 2.00/1.00, addr 1> on usbus7 Feb 10 23:23:37 pfSense kernel: uhub3 on usbus0 Feb 10 23:23:37 pfSense kernel: uhub3: <Intel UHCI root HUB, class 9/0, rev 1.00/1.00, addr 1> on usbus0 Feb 10 23:23:37 pfSense kernel: uhub4 on usbus5 Feb 10 23:23:37 pfSense kernel: uhub4: <Intel UHCI root HUB, class 9/0, rev 1.00/1.00, addr 1> on usbus5 Feb 10 23:23:37 pfSense kernel: uhub5 on usbus2 Feb 10 23:23:37 pfSense kernel: uhub5: <Intel UHCI root HUB, class 9/0, rev 1.00/1.00, addr 1> on usbus2 Feb 10 23:23:37 pfSense kernel: uhub6 on usbus4 Feb 10 23:23:37 pfSense kernel: uhub6: <Intel UHCI root HUB, class 9/0, rev 1.00/1.00, addr 1> on usbus4 Feb 10 23:23:37 pfSense kernel: uhub7 on usbus6 Feb 10 23:23:37 pfSense kernel: uhub7: <Intel UHCI root HUB, class 9/0, rev 1.00/1.00, addr 1> on usbus6 Feb 10 23:23:37 pfSense kernel: uhub0: 2 ports with 2 removable, self powered Feb 10 23:23:37 pfSense kernel: uhub6: 2 ports with 2 removable, self powered Feb 10 23:23:37 pfSense kernel: uhub5: 2 ports with 2 removable, self powered Feb 10 23:23:37 pfSense kernel: uhub7: 2 ports with 2 removable, self powered Feb 10 23:23:37 pfSense kernel: uhub3: 2 ports with 2 removable, self powered Feb 10 23:23:37 pfSense kernel: uhub4: 2 ports with 2 removable, self powered Feb 10 23:23:37 pfSense kernel: mlx5_core0: INFO: mlx5_port_module_event:709:(pid 12): Module 0, status: plugged and enabled Feb 10 23:23:37 pfSense kernel: uhub1: 6 ports with 6 removable, self powered Feb 10 23:23:37 pfSense kernel: uhub2: 6 ports with 6 removable, self powered Feb 10 23:23:37 pfSense kernel: cd0 at ahcich1 bus 0 scbus1 target 0 lun 0 Feb 10 23:23:37 pfSense kernel: cd0: <QEMU QEMU DVD-ROM 2.5+> Removable CD-ROM SCSI device Feb 10 23:23:37 pfSense kernel: cd0: Serial Number QM00003 Feb 10 23:23:37 pfSense kernel: cd0: 150.000MB/s transfers (SATA 1.x, UDMA5, ATAPI 12bytes, PIO 8192bytes) Feb 10 23:23:37 pfSense kernel: cd0: 834MB (427086 2048 byte sectors) Feb 10 23:23:37 pfSense kernel: mlx5_core0: WARN: wait_func:965:(pid 0): MANAGE_PAGES(0x108) timeout. Will cause a leak of a command resource Feb 10 23:23:37 pfSense kernel: mlx5_core0: WARN: give_pages:352:(pid 0): func_id 0x0, npages 1241, err -60 Feb 10 23:23:37 pfSense kernel: mlx5_core0: WARN: wait_func:965:(pid 0): CREATE_EQ(0x301) timeout. Will cause a leak of a command resource Feb 10 23:23:37 pfSense kernel: mlx5_core0: WARN: wait_func:965:(pid 0): MANAGE_PAGES(0x108) timeout. Will cause a leak of a command resource Feb 10 23:23:37 pfSense kernel: mlx5_core0: WARN: give_pages:373:(pid 0): page notify failed Feb 10 23:23:37 pfSense kernel: mlx5_core0: WARN: pages_work_handler:473:(pid 0): give fail -60 Feb 10 23:23:37 pfSense kernel: mlx5_core0: WARN: wait_func:965:(pid 0): DESTROY_EQ(0x302) timeout. Will cause a leak of a command resource Feb 10 23:23:37 pfSense kernel: mlx5_core0: WARN: mlx5_destroy_unmap_eq:521:(pid 0): failed to destroy a previously created eq: eqn 7 Feb 10 23:23:37 pfSense kernel: mlx5_core0: WARN: free_comp_eqs:692:(pid 0): failed to destroy EQ 0x7 Feb 10 23:23:37 pfSense kernel: mlx5_core0: WARN: wait_func:965:(pid 0): MANAGE_PAGES(0x108) timeout. Will cause a leak of a command resource Feb 10 23:23:37 pfSense kernel: mlx5_core0: ERR: reclaim_pages:442:(pid 0): failed reclaiming pages Feb 10 23:23:37 pfSense kernel: mlx5_core0: WARN: pages_work_handler:473:(pid 0): reclaim fail -60 Feb 10 23:23:37 pfSense kernel: mlx5_core0: WARN: wait_func:965:(pid 0): DESTROY_EQ(0x302) timeout. Will cause a leak of a command resource Feb 10 23:23:37 pfSense kernel: mlx5_core0: WARN: mlx5_destroy_unmap_eq:521:(pid 0): failed to destroy a previously created eq: eqn 8 Feb 10 23:23:37 pfSense kernel: mlx5_core0: WARN: free_comp_eqs:692:(pid 0): failed to destroy EQ 0x8 Feb 10 23:23:37 pfSense kernel: mlx5_core0: WARN: wait_func:965:(pid 0): DESTROY_EQ(0x302) timeout. Will cause a leak of a command resource Feb 10 23:23:37 pfSense kernel: mlx5_core0: WARN: mlx5_destroy_unmap_eq:521:(pid 0): failed to destroy a previously created eq: eqn 9 Feb 10 23:23:37 pfSense kernel: mlx5_core0: WARN: free_comp_eqs:692:(pid 0): failed to destroy EQ 0x9 Feb 10 23:23:37 pfSense kernel: mlx5_core0: WARN: wait_func:965:(pid 0): DESTROY_EQ(0x302) timeout. Will cause a leak of a command resource Feb 10 23:23:37 pfSense kernel: mlx5_core0: WARN: mlx5_destroy_unmap_eq:521:(pid 0): failed to destroy a previously created eq: eqn 10 Feb 10 23:23:37 pfSense kernel: mlx5_core0: WARN: free_comp_eqs:692:(pid 0): failed to destroy EQ 0xa Feb 10 23:23:37 pfSense kernel: mlx5_core0: ERR: mlx5_load_one:1184:(pid 0): Failed to alloc completion EQs Feb 10 23:23:37 pfSense kernel: mlx5_core0: WARN: wait_func:965:(pid 0): DESTROY_EQ(0x302) timeout. Will cause a leak of a command resource Feb 10 23:23:37 pfSense kernel: mlx5_core0: WARN: mlx5_destroy_unmap_eq:521:(pid 0): failed to destroy a previously created eq: eqn 6 Feb 10 23:23:37 pfSense kernel: mlx5_core0: WARN: wait_func:965:(pid 0): DEALLOC_UAR(0x803) timeout. Will cause a leak of a command resource Feb 10 23:23:37 pfSense kernel: mlx5_core0: WARN: up_rel_func:87:(pid 0): failed to free uar index 16 Feb 10 23:23:37 pfSense kernel: mlx5_core0: WARN: wait_func:965:(pid 0): TEARDOWN_HCA(0x103) timeout. Will cause a leak of a command resource Feb 10 23:23:37 pfSense kernel: mlx5_core0: ERR: mlx5_load_one:1261:(pid 0): tear_down_hca failed, skip cleanup Feb 10 23:23:37 pfSense kernel: mlx5_core0: ERR: init_one:1675:(pid 0): mlx5_load_one failed -60 Feb 10 23:23:37 pfSense kernel: device_attach: mlx5_core0 attach returned 60 Feb 10 23:23:37 pfSense kernel: pci0:2:0:0: Device leaked IRQ resources Feb 10 23:23:37 pfSense kernel: pci0:2:0:0: Device leaked 11 MSI-X vectors Feb 10 23:23:37 pfSense kernel: mlx5_core0: <mlx5_core> mem 0xfa000000-0xfbffffff irq 17 at device 0.1 on pci2 Feb 10 23:23:37 pfSense kernel: mlx5_core0: INFO: mlx5_port_module_event:714:(pid 12): Module 1, status: unplugged Feb 10 23:23:37 pfSense kernel: mlx5_core0: INFO: init_one:1689:(pid 0): cannot find SR-IOV PCIe cap Feb 10 23:23:37 pfSense kernel: mlx5_core: INFO: (mlx5_core0): E-Switch: Total vports 1, l2 table size(65536), per vport: max uc(128) max mc(2048) Feb 10 23:23:37 pfSense kernel: mlx5_core0: Failed to initialize SR-IOV support, error 2 Feb 10 23:23:37 pfSense kernel: mce0: Ethernet address: 50:6b:4b:28:7f:af Feb 10 23:23:37 pfSense kernel: mce0: link state changed to DOWN Feb 10 23:23:37 pfSense kernel: mlx5_core1: <mlx5_core> mem 0xf8000000-0xf9ffffff irq 16,35,36,37,38,39,40,41,42,43,44,45 at device 0.0 on pci2 Feb 10 23:23:37 pfSense kernel: mlx5_core1: ERR: mlx5_load_one:1172:(pid 544): enable msix failed Feb 10 23:23:37 pfSense kernel: mlx5_core1: ERR: init_one:1675:(pid 544): mlx5_load_one failed -6 Feb 10 23:23:37 pfSense kernel: device_attach: mlx5_core1 attach returned 6 Feb 10 23:23:37 pfSense kernel: pci0:2:0:0: Device leaked 11 MSI-X vectors Feb 10 23:23:37 pfSense kernel: mce0: INFO: mlx5e_open_locked:3255:(pid 483): NOTE: There are more RSS buckets(16) than channels(8) available
no idea why the timestamps are all the same... when I watch the boot on console each timeout sits there for ~30-60 seconds. I'll post tomorrow when I have time to sit through hours of timeouts...
-
It should create a crash dump in /var/crash if it boots far enough to recover it from swap.
Are you also running in a VM with the NIC passed through?
-
Yea, unfortunately it seems nothing is making it to /var/crash (except minfree).
This is inside proxmox 8.1.4 as per your guide (non-UEFI), one bridge virtio network device, connectx4 passed through as raw PCIe + all functions.
I don't really understand why the log times are all the same, so I made a video so you can enjoy timeouts as well :)
-
I'm not sure how many times rebooting will convince me this isnt some sort of race condition. For now it's only hard-crashing when booting with the install ISO. From disk >90% chance of slow-boot with mlx5 errors.
While trying to trigger a boot panic I learned that it reliably panics on reboot as well (from CLI - Reboot- full (stop process/remount)). Crash report attached. textdump.tar
-
Hmm, the boot log is just full of issues from that mlx NIC. This looks bad:
mlx5_core1: <mlx5_core> mem 0xfa000000-0xfbffffff irq 17 at device 0.1 on pci2 mlx5_core1: INFO: mlx5_port_module_event:714:(pid 12): Module 1, status: unplugged mlx5_core1: INFO: init_one:1689:(pid 0): cannot find SR-IOV PCIe cap mlx5_core: INFO: (mlx5_core1): E-Switch: Total vports 1, l2 table size(65536), per vport: max uc(128) max mc(2048) mlx5_core1: Failed to initialize SR-IOV support, error 2
If it really doesn't support SR-IOV then performance is going to be...limited! If it should support that then something is misconfigured and somehow hiding it.
mlx5_core1: <mlx5_core> mem 0xfa000000-0xfbffffff irq 17,46,47,48,49,50,51,52,53,54,55,56 at device 0.1 on pci2 mlx5_core1: ERR: mlx5_load_one:1172:(pid 551): enable msix failed mlx5_core1: ERR: init_one:1675:(pid 551): mlx5_load_one failed -6 device_attach: mlx5_core1 attach returned 6
Without MSI-X throughput will be very limited. Also the interrupt rate from all those irqs would be interesting. Except then it failed to attach at all.
Can you test that running bare metal?
-
IOMMU and SR-IOV are enabled in bios, but I am simply passing through the whole pci device and not using it as SRIOV. Is that not advised?
Despite all the errors, the device still runs at ~25Gb/s.
-
Unfortunately I don't have one of those NICs to test to know what the expected boot output should be.
It's actually passing 25Gbps? What are the VM specs?
-
I think the SR-IOV reporting unable to find PCIe cap is a red-herring here - since I'm not passing the card as SR-IOV (indeed I am passing then entire PCI root for the device) there's no need to fuss with SR-IOV. I'm more concerned with the MSI-X and command timeouts - which aside from causing long boots, are unpredictable and seem non-deterministic. Occasionally this results in the nic being unable to use... but for now a reboot (and a little luck) brings it back up.
Your comment on SR-IOV did give me the idea that perhaps this "new" board has some bios settings proxmox/pfsense doesn't like.
Before going bare-metal I've tried disabling SR-IOV and forcing the PCIe port to gen3 (from gen5) in BIOS to see if there is any change... but nothing. In fact, the driver still complains about not finding SR-IOV PCIe cap... despite it being disabled in BIOS. I guess this is something it's getting from the card firmware/flash.Second step I tried re-enabling everything to default in BIOS, and using mellanox's official drivers to pass the nic as SR-IOV device. I wasn't successful in this... I tried passing on the of "virtual devices" SR-IOV creates, but this only resulted in 100s timeouts for me trying to initialize the firmware/device or something, so I reverted to the previous semi-working state.
On your final question, it's a little 8 core VM sitting on one of the newer "low power" ryzen chips (7745hx - a laptop chip in reality). Haven't done much benching outside iperf3 due to the instability of pfsense. The results are ...curious?
From the pfsense VM:
[SUM] 0.00-10.01 sec 8.32 GBytes 7.14 Gbits/sec
From proxmox host (via bridge to pfSense LAN device? - why would this be faster?):
[SUM] 9.00-10.00 sec 2.09 GBytes 18.0 Gbits/sec
-
@dsouthwi said in Mellanox ConnectX-4 LX causing hard panic on boot intermittently:
From proxmox host (via bridge to pfSense LAN device? - why would this be faster?):
[SUM] 9.00-10.00 sec 2.09 GBytes 18.0 Gbits/sec
Because running iperf at that speed requires significant CPU cycles by itsef. It also single threaded so it can only use one of the passed cores.
-
S stephenw10 referenced this topic on