Slow WAN when there are lots of OUTBOUND connections (from 40k-80k).

JustADeveloper

Hello,

Normally, I get 4-6MBytes/second downloads from a webserver behind a pfsense firewall, but under certain conditions that will drop down to 10-100KBytes (tested from several locations with different equipment under both conditions). Under those "degraded" conditions, the firewall still shows almost zero CPU (9%+/- 10) usage, almost zero Memory usage (4%+/- 4), 0% MBUF Usage, and the WAN traffic graph goes from 10M to about 15M. INTERNAL LAN traffic roughly mirrors the load on the WAN. The only other change is the states table which goes from 40% to 60%. Ping times to the servers remain low at <10ms underboth conditions.

The firewall hardware is a Dell 610 (dual 2.8GHz; 24 total cores; 48G RAM) connect over a 1Gb WAN connection. The firewall is a pfsense 2.4.2-RELEASE which has a very minimal configuration. I have inbound openvpn and http and https. Outbound is NATed. I've tried all permutations of "Hardware Checksum offloading", "Hardware TCP Segment Offloading", and "Hardware Large Receive Offloading", and we get the same degraded performance. We've even lifted the upstream bandwidth caps for several hour and several reboots and we've still seen the issue.

There is nothing that indicates a problem as far as any logs or system tables I can see, and we're well under our cap of 1G/s. The firewall was running 600+ days without any issues. What could be the problem?

The problem arose when we attempted to double the number of OUTBOUND connections from about 40K to 80K (software behind the firewall connecting to client hardware in the wild [custom TCP binary protocol]). The connections refresh every 3minutes (connect/disconnect cycle). The data from each connection is pretty small and the overall load in both cases is less than the bandwidth. We can make the problem come and go by changing the number of outbound connections. If its not the actual bandwidth that's causing the problem then its the volume of opening and closing sockets to the clients. If that's the case I would have expected the states table to be overwhelmed but its relatively low.

Any thoughts on what could be causing the bottleneck when the memory buffers are empty and cpu load is low?

Thanks!

SteveITS

Are you using shaping? This thread was posted recently: https://forum.netgate.com/topic/150127/traffic-shaper-reduces-bandwith (note we've not seen that symptom)

Are you using any packages?

If you ssh to it, open shell (option 8) and run "top" does it show any processes using 100% of their CPU?

stephenw10

You should upgrade to 2.4.4p3 when you can. 2.4.3 to 2.4.4 is quite a significant step though. Is there some reason you're still on 2.4.2?

You should check the rate or state changes when this happens. If they are all TCP connections with tiny amounts of data they might all be closing quickly so the total might not look excessive.
The Monitoring Graph for States would show that.

4-6MBps seems low for a 1Gbps connection, what is limiting that?

Steve

JustADeveloper

There is no shaping or limiting of any kind on the firewall. We don't actual use/generate that much traffic, but there are a lot of opening/closing of tcp sockets. There are lots of small packets.

The only package installed on the firewall is openvpn-export-client.

On and off pfctl will bounces around 100% and then disappear maybe 20-30 seconds on and then 20-30 seconds off.

Thanks for the responses, this is giving me stuff to look at!