Download-speed drops to 0 when pfSense statesync is enabled
-
Repost from Reddit
We have two firewalls (same hardware with same IF-config) with the latest pfSense version installed. And it is configured for HA (CARP/statesync/confsync).
Now, if we download bigger files (>1GB) the download-speed is fast at first and suddenly drops to 0. (Smaller downloads or "normal" communication work as expected.)
When we failover from one to the other firewall (this works flawlessly) downloads work for a few seconds then the behaviour is the same. So the HA itself works but seems to have an effect on throughput on big downloads.
From the providers perspective the firewall stops continuing TCP flow (i.e. ACKs are suddenly missing).
We found two workarounds
- If we shut down the current backup firewall, downloads work again
- OR If we disable statesync, downloads work normally.
Now we wonder why that is. Our HA setup seems to follow best practices. Does my description sound familiar to you? Do you have any instant-advice for us?
CPU and and other ressources have low load.
-
We moved the sync interface to a dedicated physical nic. Now the issue seems to be gone.
What I think is weird is that the symptoms were so extreme. I'm now looking for possibilities to read metrics of the NIC itself. Because the ovsious metrics (packets/s, mb/s, cpu etc.) never showed any strange behaviour. -
Hmm, I've always used dedicated interfaces. Per https://docs.netgate.com/pfsense/en/latest/highavailability/pfsync.html it doesn't have to be, but has security concerns and "could [use] as high as 10% of the throughput traversing the firewall."
also: https://docs.netgate.com/pfsense/en/latest/recipes/high-availability.html
-
@teamits Yeah it was definitely a good idea to move the sync IF to a dedicated IF. Even we're not downloading big stuff, there is 30Mbit/s syncing noise. When downloading big stuff syncing takes a lot more bandwidth. So it's obviously impacting when sharing an IF with other stuff.
My worries are: If something uses bandwidth, everything should be slower but not breaking. So even if the sync stuff is sharing its IF with other stuff it should not behave like this. It should just get slower, not dropping to 0 (while other stuff still works).
So what am I missing here? This problem is solved, yes. But only by trial and error. Maybe I can find out, on what metrics I have to rely to decide if an IF is overloaded.
-
Just for your info. We've now seen the issue on multiple installations (even different hardware and pfsense versions) and could solve it on every single system by moving the sync-vlan to a dedicated physical interface.