pfSense unresponsive during and for several seconds after an iperf3 test?

Tantamount

Hello friendly people!

I've recently been upgrading bits and pieces of my network, most recently upgrading the backhaul between two switches to 10gbe by switching to fiber via the SFP+ ports.

My pfsense (2.6.0) router uses UTP 2.5GBE nics, the one connected to the switch is using an SPF+ port RJ45 adapter.

The testing computer has a 10Gbe network connection at the remote switch (fiber).

If I run a test from that machine, pfsense immediately becomes unresponsive -- pings aren't returned (not just from that machine, but from any). If I abort the test, it takes several seconds for pfsense to return responses and go back to normal.

I'm trying to determine what is going on here. Is it hardware? unstable drivers? something about the tcp/ip protocol not handling the speed mismatch?

The pfsense unit has 32gigs of ram and the dashboard is reporting these cpu details:
Intel(R) Pentium(R) Silver N6005 @ 2.00GHz
4 CPUs: 1 package(s) x 4 core(s)

The resource usage numbers are normally in single digit percentages, and the one thing that could have been suspect (Suricata) is disabled.

I believe, but need to verify, that these dmesg events occur during the iperf3 tests:
igc1: link state changed to DOWN
igc1: link state changed to UP

Driver crash? Switch reset?

The tcp/ip protocol is supposed to handle mismatches like this automatically, right? Drop packets, adjust window sizes, and otherwise inform the client to throttle?

Switch details:
At the pfsense device: Qnap QSW-M408S: 8 1GbE RJ45 ports, 4 SFP+ 10GbE ports
At the client device: Qnap QSW-M2106-4S: 6 2.5GbE RJ45 ports, 4 SFP+ 10 GbE ports

If I run a similar test with a client on a 2.5GbE port on the same switch as the unit using 10GbE it works fine and I get the expected 2.3+GbE results.

Give that pfsense becomes unresponsive during these tests, I'm going to attempt to connect to the console port so that I can look at more details in real-time, but if anyone has any idea what's going on here, or which binaries I can run from the console to capture what could be going on here I would appreciate it!

Tantamount

I just remembered -- the 2.5GbE nic that attaches to the switch's SPF+ port is using a 10GbE transceiver. I wonder if it shows as 10GbE in the switch -- like maybe the switch can't negotiate to 2.5GbE because it's not a 2.5GbE transceiver? It's just weird though because it otherwise works -- like back in the day if I tried setting a 100mbit nic to 1gbe it just wouldn't work at all.

stephenw10

I assume igc1 is the NIC connected to the switch?

Nothing else logged in pfSense once it becomes available again?

Are you testing to iperf running in pfSense directly?

Check mac stats in: sysctl dev.igc.1

Steve

Tantamount

@stephenw10

Hi Steve,

Yes, igc1 is connected to the switch.

Yeah, I've got the iperf package installed and running in server mode.

Thanks for that sysctl command -- lot of stats! I'll need to run that before and after a test to see what values change.

After watching the system log overnight, I noticed these would happen on occasion even when not testing:
igc1: link state changed to DOWN
igc1: link state changed to UP

Since moving the connection to an rj45 1gbe port, that stopped.

I'm pretty certain at this point that the problem is with the transponder and/or the SPF+ port not natively supporting 2.5GbE (The switch docs only show 1/10 for those types of ports).

I've purchased a replacement system with spf+ ports that can handle 10GbE and will report back. It uses the same CPU and will actually "only" have 16 gigs of ram, so would be a good test to see if this problem was a resource constraint issue.

stephenw10

It's unlikely you're using anything anywhere near 16GB unless there is a serious memory leak somehow. That should be pretty obvious from the monitoring graphs.