Help analyzing performance bottleneck on Protectli FW4B

softwareplumber

I'm running PFSense 2.6 on a Protectcli FW4B (Celeron J3160) with 8Gb RAM and 120Gb SSD.

This has worked very well for months. Lately I've been running a new peer-to-peer application on my LAN which regularly opens many connections to many destinations on the WAN.

This causes PFSense to completely melt down. - packet loss goes up to between 70 and 90%, and latency up to 2500ms.

Initially I thought this was a NAT issue so I posted originally about this in the NAT topic. I eliminated NAT as a problem by obtaining a new static IP address for the VM. Then, assuming that it was the size of the state table that was the problem, I created stateless rules to do the routing.

However, I still have the problem, and it is clearly NOT related to NAT. Monitor shows nothing untoward in memory, CPU, or mbuf usage. System CPU utilization spikes from 2% to a maximum of 7% when the problem occurs, but this doesn't look like a problem in itself.

The setup seems to perform well otherwise - I've seen close to 1Gb/sec throughput without issue, and the problem does not seem to be related to the actual number of bytes moving through the router.

Does anyone have any advice on identifying the actual problem? At this point my choices seem to be to buy new hardware (which I'm reluctant to do without understanding the problem) or try switching to a linux based router OS (an absolute last resort, I like PFSense)

Regards
Jon

stephenw10

Just how many states was it opening?

That's a 4 thread CPU so even if all that load was on one core it's not maxed out. That said 7% seems very low. What sort of throughput so you see when it happens?
If it's all tiny packets it's probably a ppd limit rather then a bps limit.

Where are you actually seeing the packet loss and latency to?

Steve

A Former User

@softwareplumber said in Help analyzing performance bottleneck on Protectli FW4B:

Lately I've been running a new peer-to-peer application on my LAN which regularly opens many connections to many destinations on the WAN.

Maybe you could be telling more about this software, that
we are able to imagine more about what is going in your network?

softwareplumber

@stephenw10 It was hitting about 30,000 states when I was using NAT. Now I've got the stateless ruleset working for that application it runs at less than 2000.

I'm mostly measuring things using the UI Monitor tool, graphing 'Quality' on my WAN interface against various metrics - CPU, states, memory

This reports packet losses rates peaking at 50% and latency peaking at 2000ms. Disruption is typically noticeable for a couple of minutes.

Throughput during these events (measured on the WAN interface again) is at ~20Mb/s (inpass total) and ~ 5Mb/s (outpass total). This compares to significantly higher spikes in throughput when I stress the router in other ways (200Mb/s in and out), which are not associated with dropped packets.

Looking at packets on the WAN interface, I see about 5kp/s in and out during an event. I see the router handling peaks of 20kp/s when otherwise stressed, without issue.

I don't see many packets being blocked at all because at this point I've tried to reduce my rules down to the absolute bare bones.

This has me stumped. There doesn't seem to be anything wrong. Except that it doesn't work.

softwareplumber

@dobby_ It's IPFS. Very chatty protocol, not unlike Torrent. Though at this point I'm not actually replicating any files.

I can throttle the application in various ways to make the problem go away, I am sure, but I'm likely to want to run a number of instances going forward so I'm keen to understand what is going on, rather than just hack away capabilities until it works.

stephenw10

Mmm, 30k states is nothing with 8GB RAM.

The WAN quality graph shows pings to the gateway IP by default. The first thing to try here is set the monitoring IP to something external like 8.8.8.8.
See: https://docs.netgate.com/pfsense/en/latest/routing/gateway-configure.html

Make sure you are really seeing that and not just the gateway dropping pings.

Have you tried this without going through pfSense? It could be your ISP throttling it. Or one if those cable modems that collapses with a lot of UDP traffic.

Steve

softwareplumber

@stephenw10 Yeah, next step is to bypass the router completely. If it's the ISP I'll be mad.

I'm convinced the packet dropping issue is real since, whenever it happens, my wife (who works from home) reliably screams at me because her Zoom call has dropped or Gmail has stopped working. But I'll look at changing the IP for monitoring anyway.

Thanks for the help!
Jon

softwareplumber

@dobby_

When it starts it attempts to open many connections (north of 500) to network peers. Periodically it attempts to refresh its peer connections, during which time it tries to open new connections before throwing away old or stale ones. A meltdown reliably occurs when ipfs starts up, and then periodically at intervals of exactly one hour. When I was running stateful rules I saw the number of states rising rapidly whenever the issue occurred, consistent with large number of connections being opened.

The situation is somewhat complicated by the peer-to-peer nature of the protocol because when a new node advertises itself many peer nodes may try to connect back to it.

SteveITS

@softwareplumber said in Help analyzing performance bottleneck on Protectli FW4B:

If it's the ISP I'll be mad.

Who is the ISP? A while ago we confirmed with others and AT&T that their business fiber router has (or, had) a low limit. This was a note I had from 2018, based on emails from an AT&T rep:

"AT&T Business Fiber does not support true IPv6, but customers may use 6rd to facilitate IPv6 tunneling across IPv4 infrastructures.

AT&T Business Fiber does not support “true” bridge mode, however it does support IP Passthrough Mode.

The new AT&T Business Fiber modem we deployed, the BGW210, supports up to 8,000 concurrent IP sessions."

I found another note from someone else about "AT&T Broadband Fiber" allowing all of 2000.

re: "exactly one hour," there is a patch in the new System Patches package for "Disable pf counter data preservation to temporarily work around latency when reloading large rulesets (Redmine #12827)"