Clients experience interruptions and timeouts when using Multi WAN
-
Hey everyone,
I'm currently setting up Multi WAN on a school network to meet the bandwith demand we are experiencing due to the current situation (lots of video conferencing going on at the moment, inbound and outbound).
Here is our network
100 Mbit 50 Mbit 50 Mbit 6 Mbit 16Mbit Fritzbox Fritzbox Fritzbox Fritzbox bintec WAN 1 WAN 2 WAN 3 WAN 4 WAN 5 | | | | | | | | | | \ | | | / \ | | | / ===================================== || pfSense || || LAN 1 LAN 2 || ===================================== / \ / \ Internal network USG (Unifi Security Gateway) | | | | Clients AP1 AP2 AP3 | | | Clients Clients Clients
We mainly use WAN1-3 (as they have most capacity) in a Gateway Group, WAN 4 and 5 are only for backup.
On the network we have about 100 wired clients on LAN 1 and about 300-400 wifi clients connected through Unifi Access Points controlled by a Unifi Controller in conjunction with a USG that is connected to pfSense on LAN 2.The WAN routers are all consumer routers called Fritzbox - we got them from the ISP, even though we are a school.
Here is the problem: Since I started setting up MultiWAN, I've seemed to make internet access for our wifi clients worse than it was before (just WAN 1 and WAN 2 connected to USG, configured as Load Balancing).
Users are reporting repeated interruptions and timeouts (not loading website for a number of seconds, until website is refreshed). Often browsing works like a charm and is really quick, but even when I go to random webpages every 6th or 7th time I load a website, I get stuck. Refreshing the website helps, but the problem is recurring and the students are quite irritated when using Microsoft Teams for example, with documents not loading or not syncing changes.
Also, during breaks when all students are using the wifi at the same time gateway quickly become unresponsive, with packet losses and RTT going up quickly, so that the gateways are marked as offline.
Here are some screenshots I took during a 10 minute break.
The thing is, we've always experienced outages and slow connections during breaks. But what is new is that interruptions and timeouts also when the load is low on the network, e.g. in the afternoon when few students are left at school.
Here are my questions:
- why will my WAN routers go down so quickly? Is it the fault of the consumer routers? Or is pfSense flooding them with too much traffic at a time? I've got traffic shaping in place, but it doesn't seem to help.
- how can the occasional timeouts be explained, especially during times with a low network load?
- how can I find bottlenecks on pfSense?
ideas I have:
- is the triple NAT (router, pfSense, USG) a problem?
- is USG seen as one device (not as the 300 indiviual clients behind USG), so that load balancing won't really kick in?
- is DNS resolution a problem? Clients query 8.8.8.8 and 1.1.1.1 directly (assigned by DHCP)
- playing around with the latency thresholds in the Gateway config
Any suggestions will be much appreciated.
-
So seems the problem was the triple NAT. I changed the topology in a way that clients after authentication will be placed into a VLAN directly connected to pfSense, with pfSense acting as DHCP server. Now clients don't experience timeouts or interruptions anymore, at least not when there is a low network load.
Issues with WAN routers going fairly quickly remain, however, even though they withstand the load for a bit longer than before.