Used pfSense as a router at LAN event, went badly, trying to identify the issue.
-
I and a group others were hosting a LAN party with 200 attendees, for which we had a 90/90 mbit connection. In order to manage this connection properly without spending a large amount of money, we went with using pfSense as a virtual machine running in VMware Workstation. We connected the WAN IP directly into the computer, so that pfSense would not be behind a router. On the LAN side, we connected pfSense to a 10/100 mbit switch, which was in turn connected to 11 other 10/100 switches as well as the hosting computer. Each switch was then connected to about 20 computers. I have attached a simplistic network diagram made in paint to give a visual demonstration.
As the one with the most knowledge about networking and computers in general, I configured the LAN to run on 200.200.200.x/24, with 200.200.200.222 being the gateway. We had no DHCP running however, as we wanted all attendees to manually configure their IP to reflect their seat number, so that the person sitting at seat 163 would have an IP of 200.200.200.163, making it easy for us to identify potential misuse of the network.
To limit the impact of people wanting to watch 1080p Youtube videos, we used floating point rules with queues, with HFSC as our scheduler. The result of this was that all the online games had no connection issues and ran smoothly.
Needless to say, everything was running smoothly, until it didn't. Pretty much all attendees at the LAN party were playing League of Legend, but when enough people had set up their computers and started gaming, the network went into fits. Connections would be dropped randomly, and then reestablishing 2 minutes later in preparation for another disconnect; computers would have all the correct information to connect to the internet, but were unable to connect to anything even without having the exclamation mark in their connection-icon. Until all this happened, everything went absolutely smoothly, with no latency spikes or anything of the sort for those playing online games.
We then tried to switch to a standard router. With this router everyone was able to connect to the internet, yet it was completely overloaded by the sheer volume of traffic, making for constant latency spikes. We then switched back to pfSense, and tried to work on a solution.
We were at a complete loss at to what the issue(s) were, and decided on giving refunds to every attendee who wanted it. Once we were down to about 100 people, pretty much everyone could maintain a constant connection, though there were a few outliers who simply could not even with everything correctly configured on their end.
Specifics:
Pfsense Version 2.0.3 32-bit.
Running in a virtual machine with 2 GB of ram available.
Exclusive acces to two 1 gbit ethernet ports
Floating rules with queues managed by HFSC.
90/90 mbit fiber connection.Any help is appreciated, and if needed, I will provide whatever details required.
-
First thing I would check is if the state table was exhausted.
Do you still have the rrd graph histories from the time period in question? What do they show?
-
with 90/90 you should have at least had your primary switch and ports (router to switch) gigabit to get full speed.
States? as mentioned.
Another guess would be that some possibly had duplicated IP's on the network.
You guys used a non standard subnet on your LAN. Probably not an issue here but there are reasons that its not a good idea.
Upstream device had issues.
We connected the WAN IP directly into the computer, so that pfSense would not be behind a router.
?? Can you explain this better?
-
with 90/90 you should have at least had your primary switch and ports (router to switch) gigabit to get full speed.
I realise so. Sadly, I was not involved in the procurement of either the primary switch nor the secondary switches.
States? as mentioned.
I must admit that I am quite new to pfSense, so I have not configured the state table at all. Being at the default state table size, could this be the reason for our connection issues?
There was, afaik, no mention of state table size in any of the setup guides I followed, so either I overlooked it or simply did not know about there being a configurable limit to the IP-addresses that pfSense can handle.
Another guess would be that some possibly had duplicated IP's on the network.
I checked. I even tried running through the whole lot from 2 to 254 to if that was the case, even if Windows did not detect a duplicate IP.
You guys used a non standard subnet on your LAN. Probably not an issue here but there are reasons that its not a good idea.
I realise, I just wanted something, which (I thought) would pre-empt rogue DHCP-servers and such.
Upstream device had issues.
Could you clarify?
We connected the WAN IP directly into the computer, so that pfSense would not be behind a router.
?? Can you explain this better?
I meant that pfSense was "fed" the public IP, and wasn't behind any router, firewall or NAT.
Do you still have the rrd graph histories from the time period in question? What do they show?
I sadly do not.
-
Several things going on:
1. 200+ machines, with 12 - 10/100 switches (they still make these?), your network was saturated before you even began :)…. Gigabit layer 3 switches with separate vlans would've helped, but would've been a lot of work and expensive.
2. 200.200.200.x is publicly rotatable... should have used a reserved range
3. With that many guests, should have run PFsense on bare metal.
-
Could have been many things. In a perfect world 90/200 is 450Kbit/450Kbit for each connection assuming they are all drawing bandwidth at the same rate balanced across all of them. Streaming video will go up to >1Mbit and usually around ~3-7Mbit per video being watched (assume HD quality). If someone needed to download a patch and you weren't limiting traffic, you're going to get congested.
I've never seen what the network consumption is like with a game like League of Legend, but it would have been prudent to fire up a group of users and do some measurements. As others have pointed out, you probably peaked not only your LAN but also got some WAN congestion and the state table probably filled right up.
I have 4GB in my pfSense box (bare metal) and when I run vulnerability scans from my LAN to WAN addresses I hit 350,000/380,000 states being used. So in your case I would have run a test group to pull some metrics and then would have been staring at the pfSense dashboard with another window open to the RRD graphs during the event. It would have been easier to mitigate once you saw exactly what was happening on the box at the time.
20/20 hindsight.
-
If you had posted while having these problems, then we could have offered some suggestions about how to do troubleshooting, e.g.
pfctl -sa
netstat -s
etcNow, after the fact, we can only speculate about the dozens of things that could have gone wrong.