Box seems to core dump after campus power outage
-
Hello,
I have a 1.2.2 box that has been running for quite some time on a campus network with roughly 2000 clients connecting through it. It is the default gateway for all of these clients. For the most part this machine has run flawlessly for several years. Recently I have noticed an issue that has been bothering me. Any time the campus has a power outage, within a short time frame of the power coming back on, the pfsense box seems to core dump. The pfsense machine is on a UPS, but all of the switches that connect our clients are not. I think the sudden onslaught of traffic seems to be overwhelming the machine, but I can't prove it.
I should mention I'm running the embedded version on a Dell Optiplex GX260 with 512MB RAM. I have the firewall set to 30,000 states.
If I terminal into the box, I don't get a shell or menu or anything like that, but the kernel seems to still be running in some type of way.
I don't really know what to look for, or how to figure out what is going on. If someone else has seen this, let me know if you figured out what was going on.
-
How about upgrading to the current release (1.2.3)?
-
you should (if possible) try to isolate the box with maybe a small segment of the network and simulate an outage then reconnect it to the isolated network segment to have an onslaught of traffic to confirm that part of your issue, as for the other part, I would suggest upgrading to the latest release 1.2.3 and running that and see if the issue still occurs, if it does its most likely the fact that its getting a massive amount of traffic thrown at it and like a flood gate its can't handle it all at once so the water will spill over and go where it can to relieve pressure. and if thats the case then you may want to try and get a few of the switches on UPS as well…(which for a campus why arn't they already on UPS in the first place?) that way you would be able to lessen the amount of "flow" thrown to it when the power is restored if it already has things remaining on so its not trying to handle 2,000+ people and devices and handing out IP addresses via DHCP (or however you have that set) all at the same time there by bogging things down and causing the system essentially "panic" or in your case dump...
hope this helps...and I am more of a tech/hardware guy so I am looking at this more from a hardware stand point...but I hope this is at least some help to you.
-
I also suggest an upgrade to 1.2.3. If you find a software problem in 1.2.2 its probably unlikely you'll find a volunteer to fix it.
Can you get a "maintenance window" (weekend, after midnight, …) in which you can do some controlled "power outages" to get more information on what is happening (capture console messages etc).
I think (don't recall the details) that its possible to switch the console output to the serial port, then that output could be captured on another machine (e.g. a laptop) and that might give you a bit more history of what was happening before the system core dumps and so help decide on appropriate responses.
-
Get a stronger box, for 2000 users i would spec in a dual or quad core, with at least 1gb of memory. All depends on what your doing, but i would say your box is not strong enough for that many users. I always run into this problem with captive portal and 1,000+ users hitting the portal all at once.. With captive portal off, i never have the problem.
-
I'll try to upgrade to 1.2.3. Its tough getting maintenance windows in the middle of the semester, but I'll try loading the config onto another cf card and swapping it quickly.
As far as the campus goes, it's roughly 20 buildings or so with default gateways inside the building (layer III switch) which eventually uses this box after 3 or 4 routes as the default internet gateway. I can't put all of these buildings on UPS, although some already are. Sometimes the power outage is long enough that a small UPS would get killed anyhow. Doing a simulated power outage would be quite difficult as well.
The box isn't doing any captive portal, DHCP, or DNS. I am using it strictly for the NAT/Firewall feature.
I'll have to do more testing, but in the meantime, I'm likely to upgrade.
One more thing, from what I remember, there was a ctrl sequence that I was able to press on the console that showed a ps/top type output whenever it was in this state. I can't remember what it was, but it shows that the system was at least doing something.
-
I doubt if upgrading or throwing hardware at it is going to change anything. I suspect when all the machines power up at once it's opening more connections than your 30,000 state table, that's very easy with 2000 systems, and generally much too small for that many. With 512 MB RAM you should be fine with 200,000 states depending on what else you're running. Check your states RRD graph, I'm sure you'll see it's hitting 30,000 after the power comes back on.
-
Shoot, since I'm running the embedded version, I have the rrd's saving to disk only twice per day via cron. If I kill the power or hard reboot after the power outage, I lose that graph ability.
I may try running more states. I know I have enough memory to handle them, so maybe that will do the trick. We average between 6-10K states normally, so a more than 30K burst is surely possible.
Also, in my case, all HTTP traffic goes outbound through separate proxy servers, so their states never make it into the state table. I bet initially, since most clients are now laptops and never lose power, the state table is hit much sooner than the older days when there were more desktops.
-
did you make sure that if your PF box is limiting the states it handles at any given time that you have it set high enough so if it does get that 30k burt it will not start to freak out?
reason I ask is I too my PF box and set the state limiter (i think its more for how many you get before it starts dumping them and making room for new but I may be not understanding that correctly) and set it to 50 (i only need to worry about 5 machines at most) and with all 5 machines I know i hit around 100+ (my machine allow can average around 20 - 50 depending on what I am doing) and it started freaking out but once it got sorted it was fine (and I reset it back to default values)
-
One more thing, from what I remember, there was a ctrl sequence that I was able to press on the console that showed a ps/top type output whenever it was in this state. I can't remember what it was, but it shows that the system was at least doing something.
Probably ctrl-T.
-
reason I ask is I too my PF box and set the state limiter (i think its more for how many you get before it starts dumping them and making room for new but I may be not understanding that correctly) and set it to 50 (i only need to worry about 5 machines at most) and with all 5 machines I know i hit around 100+ (my machine allow can average around 20 - 50 depending on what I am doing) and it started freaking out but once it got sorted it was fine (and I reset it back to default values)
Loading a single content-laden web page can take 20-50 states depending on how many different servers it pulls in ads, images, css, etc. from, setting it to anything less than 10,000 is a bad idea, unless you actually want to drop connections. And most browsers will use persistent connections so they won't necessarily close them immediately. When the limit is hit, it won't accept any new connections. Existing ones are never just dropped, the connections either have to be closed or time out.
-
ok, thats what I was figuring, just wanted to be sure :)
-
rklopoto: Do you have to use the embedded version? I ran embedded for awhile too because I didn't want to worry about a HDD crashing on me, but I eventually switched due to limitations and life is so much easier now :)
I'm thinking 1.2.3 on a HDD and cmb's 200,000 state recommendation
-
I don't HAVE to use the embedded version. I switched to CF because of a high rate of hard drive failures 3 or 4 years ago. I'm running desktop hardware, so there is plenty of room for a disk if I wanted to put one in. Is there a reason I shouldn't be?
-
I don't HAVE to use the embedded version. I switched to CF because of a high rate of hard drive failures 3 or 4 years ago. I'm running desktop hardware, so there is plenty of room for a disk if I wanted to put one in. Is there a reason I shouldn't be?
If you're fine with it, that's fine. Some people like to run packages that just can't be run from CF.