Bug - Mellanox MT26448
-
I bashed my skull against this problem for a few days thinking it was something I did, but it looks to be a bug in the driver for the Mellanox 'MT26448 [ConnectX EN 10GigE, PCIe 2.0 5GT/s]' card. It may only come up when used in this particular configuration I have not yet tested using different uplink hardware (though if needed I'll build another host and test it locally.
I have one port connected to my local switch at 10G, speeds on sftp are just over 100MB/sec which while not as good as it could be is probably being limited by something and I don't have another machine setup to confirm this with iperf at the moment but it is probably not relevant to this bug. The other is connected to a Bell GigaHub, the ONU/Router where the 10GbaseT-SR module is integrated and not swappable (or I bet this wouldn't be an issue at all).
Now this is where debugging got bizarre. Of course I was trying to configure it to use the 10G port on the ONU and their bridge mode (or the PPPoE pass-thru). It worked at first, but in bridge mode it would randomly fail to work properly. Doing PPPoE pass-thru from pfsense worked fine but seemed slow. I tried everything, every optimization I could find and possibly one of them has caused the issue, but none seem like they should, I'll be rolling them back to be sure but since the performance issue predated the tuning...).
I got a bit of time last week to look at it... I was getting 300Mbit-500Mbit, and considering the previous link was lucky to give 10Mbit on a good day (Xplornet Wireless) I wasn't really concerned enough to put aside the other high priority tasks until last weekend. When I swapped to using one of the 1G ports rather than the 10G on the GigaHub... everything worked as expected, bridge mode, PPPoE pass-thru, both giving wirespeed, which is two to three times the speed I was getting on the 10G port. I tried a wide array of fixes but nothing helped, and all made the reliability on the 10G port even worse or non-functional.
This morning I realized I could just plug one of my backup Linux servers directly into the 10G port to test, it has the single-port version of the card which reports the same in pciconf -lv output. That machine was able to push 3Gbit/3Gbit across the GigaHub both with and without PPPoE.
Is this a known issue? Can it be fixed? What debugging information can I provide to help? and finally, would switching to a different NIC solve it? Obviously just building a new firewall based on Linux would solve it, but I'd prefer not to have to at this point for a variety of reasons.