Severe Problems on ALIX 2D13: high cpu load, out of memory/swap

setchi

Hello,

I have two ALIX 2D13 boxes and did an upgrade from 1.2.3 to 2.0 on one box and a fresh install of 2.0 on the other box.
Since the upgrades I have severe problems with processes (like php) dying because of low memory or swap space and the overall performance of the webgui is very poor (takes 1 minute to load pages, displays error 500 randomly, have php-related error messages on the top of the page, etc.)

I have no additional packages installed and just some very simple filtering rules and the box has a load value between 3 and 7.
Whenever I try to install packages, the process never finishes and the webgui stops loading on a certain point.

I have error messages in system.log:

Dec  3 14:55:53 gateway kernel: pid 9536 (php), uid 0, was killed: out of swap space
Dec  3 14:55:53 gateway kernel: pid 17096 (php), uid 0, was killed: out of swap space
Dec  3 13:56:42 gateway check_reload_status: Syncing firewall
Dec  3 14:56:42 gateway php: /pkg_mgr_install.php: Beginning package installation for OpenVPN Client Export Utility.
Dec  3 14:57:15 gateway kernel: pid 12524 (php), uid 0, was killed: out of swap space
Dec  3 14:57:15 gateway kernel: pid 42267 (php), uid 0, was killed: out of swap space

lighttpd.error.log

2011-12-03 14:43:33: (mod_fastcgi.c.2566) unexpected end-of-file (perhaps the fastcgi process died): pid: 51917 socket: unix:/tmp/php-fastcgi.socket-19
2011-12-03 14:43:33: (mod_fastcgi.c.3354) response not received, request sent: 492 on socket: unix:/tmp/php-fastcgi.socket-19 for /preload.php?, closing connection
2011-12-03 14:46:26: (mod_fastcgi.c.2566) unexpected end-of-file (perhaps the fastcgi process died): pid: 51917 socket: unix:/tmp/php-fastcgi.socket-19
2011-12-03 14:46:26: (mod_fastcgi.c.3369) response already sent out, but backend returned error on socket: unix:/tmp/php-fastcgi.socket-19 for /pkg_mgr.php?, terminating connection
2011-12-03 14:50:16: (mod_fastcgi.c.2566) unexpected end-of-file (perhaps the fastcgi process died): pid: 51917 socket: unix:/tmp/php-fastcgi.socket-19
2011-12-03 14:50:16: (mod_fastcgi.c.3369) response already sent out, but backend returned error on socket: unix:/tmp/php-fastcgi.socket-19 for /pkg_mgr_install.php?id=OpenVPN%20Client%20Export%20Utility, terminating connection

Anyone has similar issues? Any hints on how to find out the problem?
Thanks in between.

EDIT: I should have mentioned that I use 4GB CF-Cards ;)

stephenw10

Running from CF you should be using the NanoBSD images which don't have swap at all so I'm surprised to see that.
Are you running NanoBSD?

Steve

setchi

Yes, I'm running the nanobsd image, so the message about swap is really strange.
Is it possible that the box is running out of memory?
I think this should not happen unless I have some memory consuming packages installed.

The box has a Nanobsd-Image, NO packages, a very simple ruleset and no VPN activated at all.
Why are the php/fastcgi processes terminating?

Thanks!

stephenw10

I'm not sure to be honest. I've just been looking on the forum and it seems those messages can appear on a NanoBSD install if you run low on memory.
How much RAM do you have in those boxes? How much does it show as used in the dashboard?

Steve

setchi

Hi,

The Dashboard says 70% and the box has a total of 256MB and a 500MHz AMD Geode CPU.

Thanks

EDIT: And the dashboard always says 100% for CPU usage !?

stephenw10

Hmm, 100% you say. Even when you're not doing anything?

What's using all your cpu cycles? Try running at the console:

top -S

Steve

setchi

It was a process called "idlepoll", when I deactivated the device polling option the cpu load went down, but the memory problems with dying processes are the same as before - now my box doesn't even display the dashboard :(

Thanks

stephenw10

Device polling can cause some odd things, at least it did for me last time I played with it.
I would make sure you have rebooted after enabling/disabling polling, switching it didn't always behave as I expected.
Otherwise I'm out of suggestions. Add more ram? I can't really see why you should have to if you haven't started using more services. :-\

Steve

setchi

There is no possibility to add more RAM on these embedded boxes.
I found another strange message on the serial console:

Approaching the limit on PV entries, consider increasing either the vm.pmap.shpgperproc or the vm.pmap.pv_entry_max tunable.

Metu69salemi

Read this http://forum.pfsense.org/index.php/topic,38707.0.html

To ease out:
run a shell command:```
sysctl vm.pmap.shpgperproc

answer should be 200
then create a file, if it doesn't exist```
/boot/loader.conf.local

add next line in there```
vm.pmap.shpgperproc=500

save and reboot

cmb

"out of swap space" just means you're entirely out of memory. There is no swap on embedded.

You running any packages? The stock base system wouldn't exhaust 256 MB unless you had 200,000+ active connections, but there are many other things that could with packages.

wallabybob

My home firewall has been running "full" pfSense for over 3 years on 256MB of RAM and I can't recall ever seeing it even use swap. I'm currently running only two packages: siproxd and pfflowd. I have also run bandwidthd and countryblock. top currently tells me I have 92MB free.

What packages are you trying to run on that system?

Please post a snapshot of top output. Here's a snapshot of top from my system:```

last pid: 18074; load averages: 0.07, 0.04, 0.06 up 2+07:49:06 14:45:23
60 processes: 2 running, 50 sleeping, 8 zombie
CPU: 1.6% user, 0.0% nice, 4.3% system, 1.2% interrupt, 93.0% idle
Mem: 57M Active, 18M Inact, 47M Wired, 32M Buf, 92M Free
Swap: 260M Total, 260M Free

PID USERNAME THR PRI NICE SIZE RES STATE TIME WCPU COMMAND
12997 root 1 44 0 3712K 2084K RUN 0:01 0.10% top
3552 root 1 76 20 3656K 1584K wait 5:17 0.00% sh
19946 root 1 64 20 3316K 1364K RUN 2:25 0.00% apinger
51310 root 1 64 20 7688K 4196K bpf 0:50 0.00% bandwidthd
50565 root 1 64 20 8712K 5232K bpf 0:50 0.00% bandwidthd
50805 root 1 44 0 7688K 3876K bpf 0:50 0.00% bandwidthd
50259 root 1 45 0 8712K 4908K bpf 0:50 0.00% bandwidthd
50479 root 1 44 0 7688K 3888K bpf 0:49 0.00% bandwidthd
50394 root 1 44 0 8712K 4196K bpf 0:49 0.00% bandwidthd
50777 root 1 64 20 8712K 4520K bpf 0:48 0.00% bandwidthd
51040 root 1 64 20 7688K 4216K bpf 0:48 0.00% bandwidthd
57601 root 1 76 0 34140K 23668K accept 0:46 0.00% php
61779 nobody 1 44 0 5556K 2676K select 0:45 0.00% dnsmasq
57693 root 1 76 0 33116K 21108K accept 0:44 0.00% php
24595 root 1 44 0 4944K 2544K select 0:42 0.00% syslogd
49099 root 1 44 0 2912K 740K bpf 0:37 0.00% pfflowd
60468 dhcpd 1 44 0 8436K 5372K select 0:28 0.00% dhcpd
50016 root 2 65 r21 7824K 5880K select 0:19 0.00% siproxd
51812 root 1 44 0 8464K 4580K select 0:19 0.00% mpd5
16171 root 1 44 0 5176K 2632K select 0:15 0.00% hostapd
28718 root 1 44 0 5912K 3040K bpf 0:12 0.00% tcpdump
52732 root 1 44 0 6588K 3500K kqread 0:12 0.00% lighttpd

Even with 8 copies of bandwidthd running (that I think shouldn't be running because I removed the package some reboots ago) I still have 92MB free memory.

I wonder if you have lots of extraneous processes. I have a vague memory of some reports of runaway process creation in some 2.0 snapshots builds.

setchi

Hi,

As mentioned before I have NO packages installed. The pfSense 1.2.3 worked flawlessly on the same hardware for a long time.
I did a upgrade to 2.0 and had device polling activated under 1.2.3 (gave me more bandwidth).
With device polling the load was at 100% on 2.0 and the webgui wasn't useable at all.

Afterwards I deactivated device polling and now the webgui works again (if you don't worry about the frequent error 500).
But the errors I mentioned above are the same, especially when the box has been up for some days.

I had to do a reboot yesterday evening, so currently my top displays the following:

last pid: 53715;  load averages:  0.20,  0.08,  0.02                                                                                            up 0+17:06:56  15:15:58
459 processes: 1 running, 458 sleeping
CPU:  0.4% user,  0.0% nice,  1.5% system,  2.3% interrupt, 95.8% idle
Mem: 126M Active, 14M Inact, 70M Wired, 5572K Cache, 34M Buf, 18M Free
Swap:

  PID USERNAME  THR PRI NICE   SIZE    RES STATE    TIME   WCPU COMMAND
53715 root        1  45    0  4736K  2196K RUN      0:00  0.98% top
58990 root        1  76   20  3656K  1080K wait     1:04  0.00% sh
26060 root        1  44    0  4944K  1636K select   0:51  0.00% syslogd
34843 root        1  64   20  3316K   692K select   0:34  0.00% apinger
27425 root        1  44    0  3316K   740K piperd   0:22  0.00% logger
27329 root        1  44    0  5912K  2712K bpf      0:19  0.00% tcpdump
 8719 root        1  76    0 33116K 15040K accept   0:17  0.00% php
 8684 root        1  76    0 33116K 12680K accept   0:09  0.00% php
 4764 dhcpd       1  44    0  8436K  1736K select   0:09  0.00% dhcpd
 5482 root        1  76    0 33116K 12580K accept   0:07  0.00% php
 1460 root        1  76    0 33116K 12388K accept   0:06  0.00% php
 1278 root        1  76    0 33116K 15960K accept   0:04  0.00% php
48287 root        1  44    0  6588K  3528K kqread   0:03  0.00% lighttpd
 6467 nobody      1  44    0  5556K  1288K select   0:02  0.00% dnsmasq
10276 root        1  44    0  8464K  2800K select   0:02  0.00% mpd5
 2945 root        1  44    0  5116K  2984K select   0:01  0.00% openvpn
54285 root        1  45    0 33116K 12240K accept   0:01  0.00% php
14737 root        1  76    0 33116K 10740K accept   0:01  0.00% php
59858 root        1  76    0 34140K 14748K accept   0:01  0.00% php
35322 sshd        1  76    0  6616K  2756K select   0:01  0.00% sshd
 8924 root        1  76    0 33116K 14596K accept   0:01  0.00% php
57661 root        1  44    0 33116K 11596K accept   0:01  0.00% php
14508 root        1  76    0 33116K 11084K accept   0:01  0.00% php
17403 _ntp        1  64   20  3316K  1044K select   0:01  0.00% ntpd
37689 root        1  44    0  3436K  1172K select   0:00  0.00% inetd
 1200 root        1  76    0 33116K 11744K accept   0:00  0.00% php
51639 root        1  44    0  3404K   832K nanslp   0:00  0.00% cron
14983 root        1  76    0 33116K  9772K accept   0:00  0.00% php
48520 root        1  76    0 32092K  6388K wait     0:00  0.00% php
53598 root        1  76    0 32092K  6388K wait     0:00  0.00% php
53449 root        1  76    0 32092K  6388K wait     0:00  0.00% php
52507 root        1  76    0 32092K  6388K wait     0:00  0.00% php
52757 root        1  76    0 32092K  6388K wait     0:00  0.00% php
53888 root        1  76    0 32092K  6388K wait     0:00  0.00% php
44532 root        1  44    0  7992K  2924K select   0:00  0.00% sshd
53150 root        1  76    0 32092K  6388K wait     0:00  0.00% php
53426 root        1  76    0 32092K  6388K wait     0:00  0.00% php

I also wondered why there are SOO many php processes running…

Thanks for now

wallabybob

@setchi:

I also wondered why there are SOO many php processes running…

And what are they waiting for so close to their startup?

Perhaps the system log will give some clues. pfSense shell command clog /var/log/system.log can be used to display the whole log. (The web GUI Status -> System Logs normally displays only the most recent 50 lines or so.)

jimp

If you removed bandwidthd (and siproxd I see) they didn't fully come out. Try switching to the other NanoBSD slice to see if it behaves better. First make extra sure there are no packages installed under System > Packages.

It would be rather easy to run an ALIX out of RAM.

It's normal to have a couple PHP processes going, but top doesn't show you the whole story, check the output of "ps uxawww" and see if it shows more about what PHP is doing, like if it's being used to run a certain script, like rc.filter_configure_sync, etc.