Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    Mellanox ConnectX-4 LX causing hard panic on boot intermittently

    Scheduled Pinned Locked Moved General pfSense Questions
    14 Posts 3 Posters 2.4k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • stephenw10S
      stephenw10 Netgate Administrator
      last edited by

      It should create a crash dump in /var/crash if it boots far enough to recover it from swap.

      Are you also running in a VM with the NIC passed through?

      D 2 Replies Last reply Reply Quote 0
      • D
        dsouthwi @stephenw10
        last edited by dsouthwi

        Yea, unfortunately it seems nothing is making it to /var/crash (except minfree).

        This is inside proxmox 8.1.4 as per your guide (non-UEFI), one bridge virtio network device, connectx4 passed through as raw PCIe + all functions.

        I don't really understand why the log times are all the same, so I made a video so you can enjoy timeouts as well :)

        Youtube Video

        1 Reply Last reply Reply Quote 0
        • D
          dsouthwi @stephenw10
          last edited by

          I'm not sure how many times rebooting will convince me this isnt some sort of race condition. For now it's only hard-crashing when booting with the install ISO. From disk >90% chance of slow-boot with mlx5 errors.

          While trying to trigger a boot panic I learned that it reliably panics on reboot as well (from CLI - Reboot- full (stop process/remount)). Crash report attached. textdump.tar

          1 Reply Last reply Reply Quote 0
          • stephenw10S
            stephenw10 Netgate Administrator
            last edited by

            Hmm, the boot log is just full of issues from that mlx NIC. This looks bad:

            mlx5_core1: <mlx5_core> mem 0xfa000000-0xfbffffff irq 17 at device 0.1 on pci2
            mlx5_core1: INFO: mlx5_port_module_event:714:(pid 12): Module 1, status: unplugged
            mlx5_core1: INFO: init_one:1689:(pid 0): cannot find SR-IOV PCIe cap
            mlx5_core: INFO: (mlx5_core1): E-Switch: Total vports 1, l2 table size(65536), per vport: max uc(128) max mc(2048)
            mlx5_core1: Failed to initialize SR-IOV support, error 2
            

            If it really doesn't support SR-IOV then performance is going to be...limited! If it should support that then something is misconfigured and somehow hiding it.

            mlx5_core1: <mlx5_core> mem 0xfa000000-0xfbffffff irq 17,46,47,48,49,50,51,52,53,54,55,56 at device 0.1 on pci2
            mlx5_core1: ERR: mlx5_load_one:1172:(pid 551): enable msix failed
            mlx5_core1: ERR: init_one:1675:(pid 551): mlx5_load_one failed -6
            device_attach: mlx5_core1 attach returned 6
            

            Without MSI-X throughput will be very limited. Also the interrupt rate from all those irqs would be interesting. Except then it failed to attach at all.

            Can you test that running bare metal?

            D 1 Reply Last reply Reply Quote 0
            • D
              dsouthwi @stephenw10
              last edited by

              IOMMU and SR-IOV are enabled in bios, but I am simply passing through the whole pci device and not using it as SRIOV. Is that not advised?

              Despite all the errors, the device still runs at ~25Gb/s.

              1 Reply Last reply Reply Quote 0
              • stephenw10S
                stephenw10 Netgate Administrator
                last edited by

                Unfortunately I don't have one of those NICs to test to know what the expected boot output should be.

                It's actually passing 25Gbps? What are the VM specs?

                D 1 Reply Last reply Reply Quote 0
                • D
                  dsouthwi @stephenw10
                  last edited by dsouthwi

                  I think the SR-IOV reporting unable to find PCIe cap is a red-herring here - since I'm not passing the card as SR-IOV (indeed I am passing then entire PCI root for the device) there's no need to fuss with SR-IOV. I'm more concerned with the MSI-X and command timeouts - which aside from causing long boots, are unpredictable and seem non-deterministic. Occasionally this results in the nic being unable to use... but for now a reboot (and a little luck) brings it back up.

                  Your comment on SR-IOV did give me the idea that perhaps this "new" board has some bios settings proxmox/pfsense doesn't like.
                  Before going bare-metal I've tried disabling SR-IOV and forcing the PCIe port to gen3 (from gen5) in BIOS to see if there is any change... but nothing. In fact, the driver still complains about not finding SR-IOV PCIe cap... despite it being disabled in BIOS. I guess this is something it's getting from the card firmware/flash.

                  Second step I tried re-enabling everything to default in BIOS, and using mellanox's official drivers to pass the nic as SR-IOV device. I wasn't successful in this... I tried passing on the of "virtual devices" SR-IOV creates, but this only resulted in 100s timeouts for me trying to initialize the firmware/device or something, so I reverted to the previous semi-working state.

                  On your final question, it's a little 8 core VM sitting on one of the newer "low power" ryzen chips (7745hx - a laptop chip in reality). Haven't done much benching outside iperf3 due to the instability of pfsense. The results are ...curious?

                  From the pfsense VM:

                  [SUM]   0.00-10.01  sec  8.32 GBytes  7.14 Gbits/sec
                  

                  From proxmox host (via bridge to pfSense LAN device? - why would this be faster?):

                  [SUM]   9.00-10.00  sec  2.09 GBytes  18.0 Gbits/sec 
                  
                  stephenw10S 1 Reply Last reply Reply Quote 0
                  • stephenw10S
                    stephenw10 Netgate Administrator @dsouthwi
                    last edited by

                    @dsouthwi said in Mellanox ConnectX-4 LX causing hard panic on boot intermittently:

                    From proxmox host (via bridge to pfSense LAN device? - why would this be faster?):

                    [SUM] 9.00-10.00 sec 2.09 GBytes 18.0 Gbits/sec

                    Because running iperf at that speed requires significant CPU cycles by itsef. It also single threaded so it can only use one of the passed cores.

                    1 Reply Last reply Reply Quote 0
                    • stephenw10S stephenw10 referenced this topic on
                    • R
                      rpm5099
                      last edited by

                      @stephenw10 where can someone find this "guide" spoken of here - I am very interested

                      1 Reply Last reply Reply Quote 0
                      • stephenw10S
                        stephenw10 Netgate Administrator
                        last edited by

                        https://docs.netgate.com/pfsense/en/latest/recipes/virtualize-proxmox-ve.html

                        1 Reply Last reply Reply Quote 0
                        • First post
                          Last post
                        Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.