Peculiar throughput problem pfSense to pfSense

keyser

@stephenw10 incidentally: I noticed that when I start the transfer, SCP throughput goes up for half a second to about 300KB/s and then dwindles all the way to 0 over the next 10 secs. It briefly claims the transfer is actually stalled before it kicks in again and now is somewhat stable at around 300KB/s.
It this “stalling” what causes the TCP sliding window to become useless in terms of throughput?

stephenw10

Yes, the TCP window is likely being significantly affected by.....something.

What happens if you SCP from some client behind the 2100 to the 6100? Or the other way? Is it one end specifically that can be shown to be causing the problem.

Or even better can you test from some third location to each independently?

keyser

@stephenw10 Okay I have gone to town now on testing. All testing is done to the Public IP on each sites pfsense - aka: no IPsec unless otherwise noted.
It seems I did make one little mistake in my earlier reported throughput numbers concerning client to client throughput.

Anyhow: Here's the findings:

Site A Client to Site B 2100 = 12 MB/s
Site A Client to Site B Client = 7 MB/s (Done inside IPsec Tunnel)
Site X Client to Site B 2100 = 13 MB/s

Site B Client to Site A 6100 = 7 MB/s
Site B CLient to Site A Client = 25 MB/s (Done inside IPsec Tunnel)
Site X Client to Site A 6100 = 75 MB/s

Site A 6100 to Site B 2100 = 7 MB/s
Site B 2100 to Site A 6100 = 300 KB/s

Hard to draw any conclusions from these numbers except there is something REALLY wrong when the 2100 itself has to send large number of full data packets.

But its noteworthy that site B struggles to reach the 300 Mbit line capacity in either direction during these tests. Thats likely latency playing its part though.

stephenw10

Were you able to test from the 2100 to Site X? Is that similarly throttled?

keyser

@stephenw10 Hmm, I don't have immediate access to such a test, but let me see what I can do...

stephenw10

Mmm, since the outlier there seems to be the 2100 itself sending traffic. Which is odd.

How is that 2100 configured? Anything unusual? Still using mvneta0 as WAN?

keyser

@stephenw10 Did a test from the Site B 2100 to a completely unrelated pfSense on Site X - which has about the same latency on turnaround.

It shows identical behavior to doing a transfer to Site A 6100 - that is: about 300KB/s throughput. I'm quite happy that's the case since we then know its not routing or some other config specific to my Site A setup.

To answer your question. I think the 2100 is very "standard" in the setup apart from WAN being a tagged VLAN (mvneta0.803), AND, the WAN connection is a GPON SFP that bridges ethernet to GPON. Obviously not standard, but working completely as expected from the clients point of view.
Since clients has the expected throughput up/down, and the firewall is doing NAT (one public IP only), the traffic is sourced identically on the outside, so how could the GPON SFP be the culprit?

stephenw10

Yup that does seem to narrow it to the 2100 or at least something at that site or connection.

Is that a GPON SFP module?

keyser

@stephenw10 Yes, it a fs.com module. Have been using it for a couple of years without issues - probably apart from this. I just haven't noticed the problem before because I never had the need to transfer large files directly out of the pfsense box itself.

Actually the pfSense config is more or less identical to the Site A 6100 apart from the fact it uses a different VLAN on WAN, and has a BiDi standard Ethernet SFP instead.

What do you think is the next order of business? A packet capture of the SCP file transfer session setup? Perhaps it will show us something when it starts, then stalls before resuming at a steady 300KB/s?

stephenw10

Yup, packet capture the throttled traffic. It's so extreme I'd expect to see some pretty obvious issues.

keyser

@stephenw10 Yup, and there is… Seems I’m suffering an upstream (from 2100) packet loss problem when transmission speed is going up.

Quite interesting that the clients seem to handle that loss with much less consequense for the overall throughput. They are Linux and Windows Clients.

I’ll be looking into my options for tuning or replacing the GPON SFP…..

stephenw10

Mmm, the TCP congestion control in pfSense is nothing special because it's tuned for forwading not as a TCP endpoint You may be hitting that in some unusually extreme way!

That might also explain why you see problems across the tunnel too since, presumably, the tunnel is also lossy.

keyser

@stephenw10 Hmm, well I took a look at the packet loss in general (from my monitoring systems), and there actually is none: < 0.0001%.

The thing is this site rarely uses its upstream bandwidth, and when it does it’s always from WiFi clients. The site has older WiFi 6 AP’s with a best case max bandwidth of slightly less than 400mbps. This is more less what the GPON link is (about 360/360).

So now I’m starting to think: Is the issue really the GPON bridge lacking buffers, but since the WiFi speed is more or less the same as the GPON, buffer drops rarely happens, whereas pfSense itself thinks it’s a Gbit Ethernet link (SFP), so it pushes way to many packets initially causing lots of bufferdrops?

If so, could I create some limiter/bandwidth shaping policy to remediate that?

stephenw10

Yes, you could create some outbound Limiters and use floating rules to capture that. Or just add an altq based shaper queue as default on WAN with a limit on it less than 360Mbps.

It would be a good test either way.

Or try testing from a wired client behind the 2100 which should easily hit the issue if it exists.

keyser

@stephenw10 I will run a test with a wired client later tonight.
Does the 2100 mvneta interace support ALTQ? I cannot seem to locate it in the supported interfaces list.

stephenw10

Oh good point!.... Yes it is, so you should be good to test either shaper type.

It is in the list in Plus but not CE.

keyser

@stephenw10 Right.... Sooo, the plot thickens :-(

I cannot replicate the issue - nor the packet loss - from internal clients. Any internal client on Site B that connects to the public IP of the Site A 6100 (no IPsec) transfers the file without packet loss. The speed of the wired client starts out faster than wireless clients, but very quickly tapers and settles at 7 MB/s throughput which is similar to the wireless client.

So now I'm at a complete loss... This seem to suggest that something the pfSense itself does on heavy WAN access causes the interface or GPON to drop packets that pfSense believes it has transmitted. But the same thing does not happen to packets it forwards....

Incidentally - the wired client showed that my WAN speed is actually 450 Mbps symmetrically. I can consistently get those numbers in different tests from a wired client. The 360 Mbps I reported is obviously capped by Wireless then.

This is just baffling.....

Should I try to implement ALTQ with Codel to see if makes a difference?

stephenw10

Yes I would try using codel.

But it still 'feels' like a TCP issue from the 2100 directly. Especially since that would also apply to traffic going over the VPN.

You might try setting one of the other congestion control algorithms like:

[25.03-BETA][admin@2100-3.stevew.lan]/root: kldload cc_vegas
[25.03-BETA][admin@2100-3.stevew.lan]/root: sysctl net.inet.tcp.cc
net.inet.tcp.cc.vegas.beta: 3
net.inet.tcp.cc.vegas.alpha: 1
net.inet.tcp.cc.abe_frlossreduce: 0
net.inet.tcp.cc.abe: 0
net.inet.tcp.cc.hystartplusplus.bblogs: 0
net.inet.tcp.cc.hystartplusplus.css_rounds: 5
net.inet.tcp.cc.hystartplusplus.css_growth_div: 4
net.inet.tcp.cc.hystartplusplus.n_rttsamples: 8
net.inet.tcp.cc.hystartplusplus.maxrtt_thresh: 16000
net.inet.tcp.cc.hystartplusplus.minrtt_thresh: 4000
net.inet.tcp.cc.available: 
CCmod           D PCB count
cubic           * 30
vegas             0

net.inet.tcp.cc.algorithm: cubic

[25.03-BETA][admin@2100-3.stevew.lan]/root: sysctl net.inet.tcp.cc.algorithm=vegas
net.inet.tcp.cc.algorithm: cubic -> vegas

If that makes any difference at all it would be a good clue.

keyser

@stephenw10 I’m leaving the site now, so this might be a tad to experimental to enable when it will be months before I’m back (in case it all goes south). Since I’m not really transferring data in/out of the PfSense itself, this is not a major issue right now. I’ll have a further look when I return

keyser

@stephenw10 but THANK YOU for your invaluable knowledge and desire to help. You really are indirectly one of the invaluable qualities that makes pfSense such a fantastic product.