10GB Lan causing strange performance issues, goes away when switched over to 1GB

stephenw10

Hmm, you said you tried setting MTU values but this does feel like it could be a fragmentation issue. A packet capture should show that.

Is the speed equally bad in both directions?

ngr2001

@stephenw10 I captured a PCAP, nothing is jumping out at me, anything thing specifically I should be filtering for or looking for in regards to fragmentation within Wireshark ?

ngr2001

I tried the following wireshark filters

ip.fragment

ip.flags.mf ==1 or ip.frag_offset gt 0

I get 0 returned data, this is leading me to believe there is no fragmentation going on.

stephenw10

@stephenw10 said in 10GB Lan causing strange performance issues, goes away when switched over to 1GB:

Is the speed equally bad in both directions?

This could be telling if it's not.

ngr2001

@stephenw10 Are you suggesting that I send a large file from the pfsense side to a target SFTP server on my LAN and see if it can sustain the same level of performance as my other tests ?

stephenw10

Yes. Or just when you test against fast.com do you also see restricted upload? Assuming your WAN is 1G symmetric.

ngr2001

@stephenw10 Ah, sorry, that will not be a good test. I am on cable internet. My download speed is 1Gb but my upload is only 30Mb :( so sadly that test will be of no value.

Anything else we can play with or check in logs, again no fragmentation in the PCAP, looks clean. Its like pfsense is just tanking.

I also tried enabling all the hardware offloading, was previously disabled, no difference.

ngr2001

This is interesting.

The port on my switch for the client/workstation shows output drops, this rapidly goes up when I run a speed test.

But the 10GB port uplink to the firewall shows none.

Perhaps the issue is on the Cisco side ?

My Understanding of the 3650 is that it does not have true flow control support

ngr2001

To add to this, the Total output drops stop once I switch back to the 1Gb Lan connection.

So there is clearly something happening on the Cisco side regarding the 10GB SPF+ connection in that all the client ports are registering output drops.

stephenw10

Hmm, that is curious. You would not think the 10G link should make any difference there. The total rate is still limited by the incoming WAN to less than the 1G link to the client.

But it does start to look like an issue between the switch and client I agree. Try testing from a different client or different NIC type.

I would also try enabling whatever flow control the switch does have. At least as a test.

ngr2001

This article seems to describe my issue.
https://www.cisco.com/c/en/us/support/docs/switches/catalyst-3850-series-switches/200594-Catalyst-3850-Troubleshooting-Output-dr.html

So far I tried disabling QOS on all ports on the switch and the performance has since doubled, getting 600Mpbs now appose to 300Mbps. I am still seeing output drops but not as many, so getting closer. I am at least happy and convinced this issue is purely a Cisco switch issue and not a pfSense bug.

the article is a little confusing but I sill if what they recommend does the trick.

stephenw10

Ah, nice. Yeah I would never have suspected that, good catch!

lnguyen

@ngr2001 This was discussed 3+ years ago @ this thread

This is a TCP flow control negotiation issue that exists somewhere upstream from the 1GbE LAN client. For me, I am unsure if this is pfSense or the Comcast Cable modem. One way to deal with this is using ethernet flow control but it is an ugly sledgehammer solution.

The Cisco solution is to put this in your 3850 config to increase the buffers for the switch ports that are suffering from output drops:

qos queue-softmax-multiplier 1200

stephenw10

Hmm, hard to see how TCP flow control could lead to packet drops from a switch...

Unless the client fills it's buffers and can no longer accept packets maybe...

lnguyen

@stephenw10 That is exactly why. TCP flow control negotiation between the source and destination should prevent this from occurring but something is preventing this from occurring. L2 flow control works but is a stupid blunt hammer.

stephenw10

Hmm. Well also hard to see how either pfSense or the switch could have any effect on TCP flow between the client and server...

Other than perhaps with the 1G link the flow is sufficiently restricted that the TCP control never comes into effect.

lnguyen

@stephenw10 Basic query to ChatGPT or Gemini gives you the similar response. Plus I already dealt with this working at Fortune #1/2 with your appliances.

AI Overview
Learn more
A firewall can potentially disrupt TCP flow control by inspecting and modifying packets in a way that interferes with the mechanisms used to manage data transmission rates between two devices, potentially leading to data loss or congestion issues if not configured properly.
How a firewall could break TCP flow control:
Packet filtering based on TCP flags:
If a firewall aggressively filters packets based on TCP flags like SYN, FIN, or ACK, it could inadvertently drop essential packets used for flow control, like the "window update" packets which signal how much data the receiver can accept.
Deep packet inspection (DPI):
If a firewall performs deep packet inspection on TCP data, it might modify the data stream in a way that alters the TCP sequence numbers, causing confusion in the flow control mechanism.
State-based inspection limitations:
While stateful firewalls track TCP connections, they might not always accurately interpret complex flow control scenarios, potentially causing issues when dealing with large data transfers or dynamic window sizes.
Incorrect configuration:
Misconfigured firewall rules, like overly strict filtering or improper flow control settings, can lead to unintended disruptions in TCP traffic management.
Potential consequences of a firewall disrupting TCP flow control:
Packet loss:
If a firewall drops important flow control packets, the sender might continue sending data faster than the receiver can handle, resulting in data loss.
Congestion:
When flow control is disrupted, network congestion can occur as senders continue to transmit data without receiving proper feedback about the receiver's capacity.

stephenw10

Mmm, none of that should apply to pfSense unless a user has added complex custom rules or packages.

Potentially there could be some TCP flag sequence that pf doesn't see as legitimate. But I'd imagine that would break a lot of things. We'd be seeing floods of tickets. And pf would log that as blocked packets in the firewall logs (unless that has been disabled).

lnguyen

@stephenw10 All I can tell you is that with an OOB Netgate XG-1537 appliance configured with just the wizard using the two SFP+ ports for 10GbE WAN & 10GbE LAN downlinked to a Cisco Catalyst 9300 mGig switch, the 10GbE clients have no issues with output drops--but the 1GbE clients do due to lack of TCP flow control working when traffic flows through the pfSense.

iPerf3 on the LAN between 10GbE and 1GbE clients show no output drops. Even performing iperf3 between the pfSense LAN interface and 1GbE shows no output drops. Try to iPerf3 through the pfSense to other corporate and DC servers and speeds drop depending on the buffer of the switch. Larger switches like Cisco 4500E and 9400 don't exhibit this issue but smaller 1RU switches like the 3850X/9300 do because of small buffers--which can be overcome with the command I provided earlier.

Move the 3850X/9300 switch from out behind the XG-1537 and both 10GbE and 1GbE clients hitting Ookla speedtest across the internet and the results are 9.4Gbps and 940Mbps respectively. Same 1GbE clients hitting iperf3 servers on other areas of the corporate LAN or DCs also show full 940Mbps bidi.

lnguyen

@stephenw10 said in 10GB Lan causing strange performance issues, goes away when switched over to 1GB:

We'd be seeing floods of tickets. And pf would log that as blocked packets in the firewall logs (unless that has been disabled).

In terms of this. What is the percentage of pfSense users that have say 2/5/10Gbps Internet coming into the pfSense WAN interface and then having a mixture of 10GbE and 1GbE clients downstream. I would think the number is niche still. However I deployed several dozen of your appliances along with other security appliances in the fruit company. Connecting your appliances with 1GbE to the WAN would always resolve the issue. Connecting your appliances with 10GbE or greater always provided the same outcome for 1GbE clients downstream. I am not criticizing you but this is from over 3 decades of network engineering experience with all types of firewalls from Cisco, PAN, Fortinet and Netgate. They all can create the same issues.