Severe Problems on ALIX 2D13: high cpu load, out of memory/swap



  • Hello,

    I have two ALIX 2D13 boxes and did an upgrade from 1.2.3 to 2.0 on one box and a fresh install of 2.0 on the other box.
    Since the upgrades I have severe problems with processes (like php) dying because of low memory or swap space and the overall performance of the webgui is very poor (takes 1 minute to load pages, displays error 500 randomly, have php-related error messages on the top of the page, etc.)

    I have no additional packages installed and just some very simple filtering rules and the box has a load value between 3 and 7.
    Whenever I try to install packages, the process never finishes and the webgui stops loading on a certain point.

    I have error messages in system.log:

    Dec  3 14:55:53 gateway kernel: pid 9536 (php), uid 0, was killed: out of swap space
    Dec  3 14:55:53 gateway kernel: pid 17096 (php), uid 0, was killed: out of swap space
    Dec  3 13:56:42 gateway check_reload_status: Syncing firewall
    Dec  3 14:56:42 gateway php: /pkg_mgr_install.php: Beginning package installation for OpenVPN Client Export Utility.
    Dec  3 14:57:15 gateway kernel: pid 12524 (php), uid 0, was killed: out of swap space
    Dec  3 14:57:15 gateway kernel: pid 42267 (php), uid 0, was killed: out of swap space
    

    lighttpd.error.log

    2011-12-03 14:43:33: (mod_fastcgi.c.2566) unexpected end-of-file (perhaps the fastcgi process died): pid: 51917 socket: unix:/tmp/php-fastcgi.socket-19
    2011-12-03 14:43:33: (mod_fastcgi.c.3354) response not received, request sent: 492 on socket: unix:/tmp/php-fastcgi.socket-19 for /preload.php?, closing connection
    2011-12-03 14:46:26: (mod_fastcgi.c.2566) unexpected end-of-file (perhaps the fastcgi process died): pid: 51917 socket: unix:/tmp/php-fastcgi.socket-19
    2011-12-03 14:46:26: (mod_fastcgi.c.3369) response already sent out, but backend returned error on socket: unix:/tmp/php-fastcgi.socket-19 for /pkg_mgr.php?, terminating connection
    2011-12-03 14:50:16: (mod_fastcgi.c.2566) unexpected end-of-file (perhaps the fastcgi process died): pid: 51917 socket: unix:/tmp/php-fastcgi.socket-19
    2011-12-03 14:50:16: (mod_fastcgi.c.3369) response already sent out, but backend returned error on socket: unix:/tmp/php-fastcgi.socket-19 for /pkg_mgr_install.php?id=OpenVPN%20Client%20Export%20Utility, terminating connection
    

    Anyone has similar issues? Any hints on how to find out the problem?
    Thanks in between.

    EDIT: I should have mentioned that I use 4GB CF-Cards ;)


  • Netgate Administrator

    Running from CF you should be using the NanoBSD images which don't have swap at all so I'm surprised to see that.
    Are you running NanoBSD?

    Steve



  • Yes, I'm running the nanobsd image, so the message about swap is really strange.
    Is it possible that the box is running out of memory?
    I think this should not happen unless I have some memory consuming packages installed.

    The box has a Nanobsd-Image, NO packages, a very simple ruleset and no VPN activated at all.
    Why are the php/fastcgi processes terminating?

    Thanks!


  • Netgate Administrator

    I'm not sure to be honest. I've just been looking on the forum and it seems those messages can appear on a NanoBSD install if you run low on memory.
    How much RAM do you have in those boxes? How much does it show as used in the dashboard?

    Steve



  • Hi,

    The Dashboard says 70% and the box has a total of 256MB and a 500MHz AMD Geode CPU.

    Thanks

    EDIT: And the dashboard always says 100% for CPU usage !?


  • Netgate Administrator

    Hmm, 100% you say. Even when you're not doing anything?

    What's using all your cpu cycles? Try running at the console:

    top -S
    

    Steve



  • It was a process called "idlepoll", when I deactivated the device polling option the cpu load went down, but the memory problems with dying processes are the same as before - now my box doesn't even display the dashboard :(

    Thanks


  • Netgate Administrator

    Device polling can cause some odd things, at least it did for me last time I played with it.
    I would make sure you have rebooted after enabling/disabling polling, switching it didn't always behave as I expected.
    Otherwise I'm out of suggestions. Add more ram? I can't really see why you should have to if you haven't started using more services.  :-\

    Steve



  • There is no possibility to add more RAM on these embedded boxes.
    I found another strange message on the serial console:

    Approaching the limit on PV entries, consider increasing either the vm.pmap.shpgperproc or the vm.pmap.pv_entry_max tunable.
    


  • Read this http://forum.pfsense.org/index.php/topic,38707.0.html

    To ease out:
    run  a shell command:```
    sysctl vm.pmap.shpgperproc

    answer should be 200
    then create a file, if it doesn't exist```
    /boot/loader.conf.local
    

    add next line in there```
    vm.pmap.shpgperproc=500

    save and reboot


  • "out of swap space" just means you're entirely out of memory. There is no swap on embedded.

    You running any packages? The stock base system wouldn't exhaust 256 MB unless you had 200,000+ active connections, but there are many other things that could with packages.



  • My home firewall has been running "full" pfSense for over 3 years on 256MB of RAM and I can't recall ever seeing it even use swap. I'm currently running only two packages: siproxd and pfflowd. I have also run bandwidthd and countryblock. top currently tells me I have 92MB free.

    What packages are you trying to run on that system?

    Please post a snapshot of top output. Here's a snapshot of top from my system:```

    last pid: 18074;  load averages:  0.07,  0.04,  0.06                                                up 2+07:49:06  14:45:23
    60 processes:  2 running, 50 sleeping, 8 zombie
    CPU:  1.6% user,  0.0% nice,  4.3% system,  1.2% interrupt, 93.0% idle
    Mem: 57M Active, 18M Inact, 47M Wired, 32M Buf, 92M Free
    Swap: 260M Total, 260M Free

    PID USERNAME  THR PRI NICE  SIZE    RES STATE    TIME  WCPU COMMAND
    12997 root        1  44    0  3712K  2084K RUN      0:01  0.10% top
    3552 root        1  76  20  3656K  1584K wait    5:17  0.00% sh
    19946 root        1  64  20  3316K  1364K RUN      2:25  0.00% apinger
    51310 root        1  64  20  7688K  4196K bpf      0:50  0.00% bandwidthd
    50565 root        1  64  20  8712K  5232K bpf      0:50  0.00% bandwidthd
    50805 root        1  44    0  7688K  3876K bpf      0:50  0.00% bandwidthd
    50259 root        1  45    0  8712K  4908K bpf      0:50  0.00% bandwidthd
    50479 root        1  44    0  7688K  3888K bpf      0:49  0.00% bandwidthd
    50394 root        1  44    0  8712K  4196K bpf      0:49  0.00% bandwidthd
    50777 root        1  64  20  8712K  4520K bpf      0:48  0.00% bandwidthd
    51040 root        1  64  20  7688K  4216K bpf      0:48  0.00% bandwidthd
    57601 root        1  76    0 34140K 23668K accept  0:46  0.00% php
    61779 nobody      1  44    0  5556K  2676K select  0:45  0.00% dnsmasq
    57693 root        1  76    0 33116K 21108K accept  0:44  0.00% php
    24595 root        1  44    0  4944K  2544K select  0:42  0.00% syslogd
    49099 root        1  44    0  2912K  740K bpf      0:37  0.00% pfflowd
    60468 dhcpd      1  44    0  8436K  5372K select  0:28  0.00% dhcpd
    50016 root        2  65  r21  7824K  5880K select  0:19  0.00% siproxd
    51812 root        1  44    0  8464K  4580K select  0:19  0.00% mpd5
    16171 root        1  44    0  5176K  2632K select  0:15  0.00% hostapd
    28718 root        1  44    0  5912K  3040K bpf      0:12  0.00% tcpdump
    52732 root        1  44    0  6588K  3500K kqread  0:12  0.00% lighttpd

    Even with 8 copies of bandwidthd running (that I think shouldn't be running because I removed the package some reboots ago) I still have 92MB free memory.
    
    I wonder if you have lots of extraneous processes. I have a vague memory of some reports of runaway process creation in some 2.0 snapshots builds.


  • Hi,

    As mentioned before I have NO packages installed. The pfSense 1.2.3 worked flawlessly on the same hardware for a long time.
    I did a upgrade to 2.0 and had device polling activated under 1.2.3 (gave me more bandwidth).
    With device polling the load was at 100% on 2.0 and the webgui wasn't useable at all.

    Afterwards I deactivated device polling and now the webgui works again (if you don't worry about the frequent error 500).
    But the errors I mentioned above are the same, especially when the box has been up for some days.

    I had to do a reboot yesterday evening, so currently my top displays the following:

    last pid: 53715;  load averages:  0.20,  0.08,  0.02                                                                                            up 0+17:06:56  15:15:58
    459 processes: 1 running, 458 sleeping
    CPU:  0.4% user,  0.0% nice,  1.5% system,  2.3% interrupt, 95.8% idle
    Mem: 126M Active, 14M Inact, 70M Wired, 5572K Cache, 34M Buf, 18M Free
    Swap:
    
      PID USERNAME  THR PRI NICE   SIZE    RES STATE    TIME   WCPU COMMAND
    53715 root        1  45    0  4736K  2196K RUN      0:00  0.98% top
    58990 root        1  76   20  3656K  1080K wait     1:04  0.00% sh
    26060 root        1  44    0  4944K  1636K select   0:51  0.00% syslogd
    34843 root        1  64   20  3316K   692K select   0:34  0.00% apinger
    27425 root        1  44    0  3316K   740K piperd   0:22  0.00% logger
    27329 root        1  44    0  5912K  2712K bpf      0:19  0.00% tcpdump
     8719 root        1  76    0 33116K 15040K accept   0:17  0.00% php
     8684 root        1  76    0 33116K 12680K accept   0:09  0.00% php
     4764 dhcpd       1  44    0  8436K  1736K select   0:09  0.00% dhcpd
     5482 root        1  76    0 33116K 12580K accept   0:07  0.00% php
     1460 root        1  76    0 33116K 12388K accept   0:06  0.00% php
     1278 root        1  76    0 33116K 15960K accept   0:04  0.00% php
    48287 root        1  44    0  6588K  3528K kqread   0:03  0.00% lighttpd
     6467 nobody      1  44    0  5556K  1288K select   0:02  0.00% dnsmasq
    10276 root        1  44    0  8464K  2800K select   0:02  0.00% mpd5
     2945 root        1  44    0  5116K  2984K select   0:01  0.00% openvpn
    54285 root        1  45    0 33116K 12240K accept   0:01  0.00% php
    14737 root        1  76    0 33116K 10740K accept   0:01  0.00% php
    59858 root        1  76    0 34140K 14748K accept   0:01  0.00% php
    35322 sshd        1  76    0  6616K  2756K select   0:01  0.00% sshd
     8924 root        1  76    0 33116K 14596K accept   0:01  0.00% php
    57661 root        1  44    0 33116K 11596K accept   0:01  0.00% php
    14508 root        1  76    0 33116K 11084K accept   0:01  0.00% php
    17403 _ntp        1  64   20  3316K  1044K select   0:01  0.00% ntpd
    37689 root        1  44    0  3436K  1172K select   0:00  0.00% inetd
     1200 root        1  76    0 33116K 11744K accept   0:00  0.00% php
    51639 root        1  44    0  3404K   832K nanslp   0:00  0.00% cron
    14983 root        1  76    0 33116K  9772K accept   0:00  0.00% php
    48520 root        1  76    0 32092K  6388K wait     0:00  0.00% php
    53598 root        1  76    0 32092K  6388K wait     0:00  0.00% php
    53449 root        1  76    0 32092K  6388K wait     0:00  0.00% php
    52507 root        1  76    0 32092K  6388K wait     0:00  0.00% php
    52757 root        1  76    0 32092K  6388K wait     0:00  0.00% php
    53888 root        1  76    0 32092K  6388K wait     0:00  0.00% php
    44532 root        1  44    0  7992K  2924K select   0:00  0.00% sshd
    53150 root        1  76    0 32092K  6388K wait     0:00  0.00% php
    53426 root        1  76    0 32092K  6388K wait     0:00  0.00% php
    
    

    I also wondered why there are SOO many php processes running…

    Thanks for now



  • @setchi:

    I also wondered why there are SOO many php processes running…

    And what are they waiting for so close to their startup?

    Perhaps the system log will give some clues. pfSense shell command clog /var/log/system.log can be used to display the whole log. (The web GUI Status -> System Logs normally displays only the most recent 50 lines or so.)


  • Rebel Alliance Developer Netgate

    If you removed bandwidthd (and siproxd I see) they didn't fully come out. Try switching to the other NanoBSD slice to see if it behaves better. First make extra sure there are no packages installed under System > Packages.

    It would be rather easy to run an ALIX out of RAM.

    It's normal to have a couple PHP processes going, but top doesn't show you the whole story, check the output of "ps uxawww" and see if it shows more about what PHP is doing, like if it's being used to run a certain script, like rc.filter_configure_sync, etc.


Log in to reply