Box seems to core dump after campus power outage

wallabybob

I also suggest an upgrade to 1.2.3. If you find a software problem in 1.2.2 its probably unlikely you'll find a volunteer to fix it.

Can you get a "maintenance window" (weekend, after midnight, …) in which you can do some controlled "power outages" to get more information on what is happening (capture console messages etc).

I think (don't recall the details) that its possible to switch the console output to the serial port, then that output could be captured on another machine (e.g. a laptop) and that might give you a bit more history of what was happening before the system core dumps and so help decide on appropriate responses.

djamp42

Get a stronger box, for 2000 users i would spec in a dual or quad core, with at least 1gb of memory. All depends on what your doing, but i would say your box is not strong enough for that many users. I always run into this problem with captive portal and 1,000+ users hitting the portal all at once.. With captive portal off, i never have the problem.

rklopoto

I'll try to upgrade to 1.2.3. Its tough getting maintenance windows in the middle of the semester, but I'll try loading the config onto another cf card and swapping it quickly.

As far as the campus goes, it's roughly 20 buildings or so with default gateways inside the building (layer III switch) which eventually uses this box after 3 or 4 routes as the default internet gateway. I can't put all of these buildings on UPS, although some already are. Sometimes the power outage is long enough that a small UPS would get killed anyhow. Doing a simulated power outage would be quite difficult as well.

The box isn't doing any captive portal, DHCP, or DNS. I am using it strictly for the NAT/Firewall feature.

I'll have to do more testing, but in the meantime, I'm likely to upgrade.

One more thing, from what I remember, there was a ctrl sequence that I was able to press on the console that showed a ps/top type output whenever it was in this state. I can't remember what it was, but it shows that the system was at least doing something.

cmb

I doubt if upgrading or throwing hardware at it is going to change anything. I suspect when all the machines power up at once it's opening more connections than your 30,000 state table, that's very easy with 2000 systems, and generally much too small for that many. With 512 MB RAM you should be fine with 200,000 states depending on what else you're running. Check your states RRD graph, I'm sure you'll see it's hitting 30,000 after the power comes back on.

rklopoto

Shoot, since I'm running the embedded version, I have the rrd's saving to disk only twice per day via cron. If I kill the power or hard reboot after the power outage, I lose that graph ability.

I may try running more states. I know I have enough memory to handle them, so maybe that will do the trick. We average between 6-10K states normally, so a more than 30K burst is surely possible.

Also, in my case, all HTTP traffic goes outbound through separate proxy servers, so their states never make it into the state table. I bet initially, since most clients are now laptops and never lose power, the state table is hit much sooner than the older days when there were more desktops.

jaime

did you make sure that if your PF box is limiting the states it handles at any given time that you have it set high enough so if it does get that 30k burt it will not start to freak out?

reason I ask is I too my PF box and set the state limiter (i think its more for how many you get before it starts dumping them and making room for new but I may be not understanding that correctly) and set it to 50 (i only need to worry about 5 machines at most) and with all 5 machines I know i hit around 100+ (my machine allow can average around 20 - 50 depending on what I am doing) and it started freaking out but once it got sorted it was fine (and I reset it back to default values)

wallabybob

@rklopoto:

One more thing, from what I remember, there was a ctrl sequence that I was able to press on the console that showed a ps/top type output whenever it was in this state. I can't remember what it was, but it shows that the system was at least doing something.

Probably ctrl-T.

cmb

@jaime:

reason I ask is I too my PF box and set the state limiter (i think its more for how many you get before it starts dumping them and making room for new but I may be not understanding that correctly) and set it to 50 (i only need to worry about 5 machines at most) and with all 5 machines I know i hit around 100+ (my machine allow can average around 20 - 50 depending on what I am doing) and it started freaking out but once it got sorted it was fine (and I reset it back to default values)

Loading a single content-laden web page can take 20-50 states depending on how many different servers it pulls in ads, images, css, etc. from, setting it to anything less than 10,000 is a bad idea, unless you actually want to drop connections. And most browsers will use persistent connections so they won't necessarily close them immediately. When the limit is hit, it won't accept any new connections. Existing ones are never just dropped, the connections either have to be closed or time out.

jaime

ok, thats what I was figuring, just wanted to be sure :)

trentdk

rklopoto: Do you have to use the embedded version? I ran embedded for awhile too because I didn't want to worry about a HDD crashing on me, but I eventually switched due to limitations and life is so much easier now :)

I'm thinking 1.2.3 on a HDD and cmb's 200,000 state recommendation

rklopoto

I don't HAVE to use the embedded version. I switched to CF because of a high rate of hard drive failures 3 or 4 years ago. I'm running desktop hardware, so there is plenty of room for a disk if I wanted to put one in. Is there a reason I shouldn't be?

cmb

@rklopoto:

I don't HAVE to use the embedded version. I switched to CF because of a high rate of hard drive failures 3 or 4 years ago. I'm running desktop hardware, so there is plenty of room for a disk if I wanted to put one in. Is there a reason I shouldn't be?

If you're fine with it, that's fine. Some people like to run packages that just can't be run from CF.