Hanging webGUI fix

daq

I searched the forum, but couldn't find any working suggestions for restarting webGUI after it crashes so I thought this could help someone. It's not enough to just kill lighttpd because it'll just hang again as soon as you start it. You also need to remove php sockets from /tmp


srwxr-xr-x  1 root  wheel       0 Apr 23 18:59 php-fastcgi.socket-0
srwxr-xr-x  1 root  wheel       0 Apr 23 18:59 php-fastcgi.socket-1

So the complete solution is

Optionally confirm that you've got a few zombies:

ps -ajx |grep Z

You can then kill by PPID (3rd colum) or just:

killall -9 php; killall -9 lighttpd

Delete the sockets:

rm /tmp/php-fastcgi.socket*

Restart lighttpd:

/etc/rc.restart_webgui

Restarting the entire box will work too, but sometimes its not an option.

cmb

You must be on 2.0.2, better fix is to upgrade to 2.0.3 where fastcgi doesn't crash anymore.

daq

I am on 2.0.3. Still crashes regularly.

marama

@daq:

I am on 2.0.3. Still crashes regularly.

I am also running 2.0.3 , nano on Alix boards (have tried 2 x D13 boards).
Crashes every few minutes. I found no entry in the logs.
Am very sad.

kejianshi

I'd love to know if there is some common factor with the few people who's web gui are crashing?

Memory maxed out? Certain combination of packages? Web Gui mods? Hardware? Install type? Anything?

Because 2.03 for me is rock solid.

doktornotor

On Alix? Yeah, out of RAM plus insane packages installed (such as squid, snort etc.)

kejianshi

Hmmmm. Well I suppose treating an Alix build like it had dual xeon processors and 64GB of RAM would do that…

marama

@doktornotor:

On Alix? Yeah, out of RAM plus insane packages installed (such as squid, snort etc.)

No, no packages installed. I've disabled local logging, am using remote syslog.

Aug 26 13:27:27 xxx.xxx.xxx.xxx check_reload_status: Syncing firewall
Aug 26 13:27:28 xxx.xxx.xxx.xxx lighttpd[56597]: (server.c.1546) server stopped by UID = 0 PID = 61135
Aug 26 13:27:28 xxx.xxx.xxx.xxx lighttpd[56597]: (server.c.1546) server stopped by UID = 0 PID = 61135
Aug 26 13:27:28 xxx.xxx.xxx.xxx minicron: (/etc/rc.prunecaptiveportal) terminated by signal 15 (Terminated: 15)
Aug 26 13:27:28 xxx.xxx.xxx.xxx kernel: IP firewall unloaded
Aug 26 13:27:28 xxx.xxx.xxx.xxx check_reload_status: Reloading filter
Aug 26 13:27:33 xxx.xxx.xxx.xxx php: : MONITOR: GW3G is down, removing from routing group
Aug 26 13:27:53 xxx.xxx.xxx.xxx sshd[17069]: fatal: Write failed: Operation not permitted
Aug 26 13:27:53 xxx.xxx.xxx.xxx sshd[17069]: fatal: Write failed: Operation not permitted

also SSH seems to "crash" (my ssh connection is terminated)

I either wait or do this in order to bring the webGUI back up:

killall -9 php; killall -9 lighttpd; /etc/rc.restart_webgui

The only package I have installed is darkstat, I will uninstall it and watch the behavior. Actually, darkstat "crashes" also when I choose more than 1 interface to monitor so I am having hopes it might be the cause of the webGUI crashing.

Name XXX
Version 2.0.3-RELEASE (i386)
built on Fri Apr 12 10:22:18 EDT 2013
FreeBSD 8.1-RELEASE-p13

You are on the latest version.
Platform nanobsd (4g)
NanoBSD Boot Slice pfsense0 / ad0s1
CPU Type Geode(TM) Integrated Processor by AMD PCS
Uptime
Current date/time
Mon Aug 26 13:36:31 CEST 2013
DNS server(s) 127.0.0.1
xxx.xxx.xxx.xxx
xxx.xxx.xxx.xxx
Last config change Mon Aug 26 13:34:00 CEST 2013
State table size
Show states
MBUF Usage 646/8640
CPU usage 2%
Memory usage 32%
Disk usage 9%

top:
CPU: 9.6% user, 0.0% nice, 5.1% system, 0.0% interrupt, 85.4% idle
Mem: 32M Active, 43M Inact, 44M Wired, 34M Buf, 115M Free

I don't think RAM or CPU is the problem, but this info is after I've uninstalled DarkStat. I'll watch and let you know. The board is "PC Engines ALIX.2D13" (500 MHz AMD Geode LX800 CPU, 256 MB SDRAM). BTW, I was using DarkStat to trace the heavy IP usage (LAN and WAN), but was not happy without being able to use more then one interface at the time. Are there any good alternatives? I just need to see the top X heavy users by source and dest IP. pfTop was to much info (didn't get arround well), have tried bandwidthd, also not so happy. What do you guys use?

In case Alix is too weak, also Soekris would be affordable. I am not happy with Alix throughoutput anyway (50-80 Mb/s), need to be able to reach LAN <=> DMZ with much more speed. Would I be better off with Soekris net6501-30 (600 MHz) or maybe net6501-50 (1GHz Atom). I wouldn't like to oversize it. We are using ASA 5510 for VPN, so I just need the routing and some basic monitoring.

doktornotor

1/ Considering the lighttpd gets stopped right after this:


Aug 26 13:27:28 xxx.xxx.xxx.xxx minicron: (/etc/rc.prunecaptiveportal) terminated by signal 15 (Terminated: 15)

you might turn off the captive portal as well. Does not even look like it's killed, just regular stop.

2/ Darkstat can ONLY listen on one interface; at least until it gets updated.

P.S. Using exact same board on multiple places with 2.1RC with no GUI crashes at all.

marama

@doktornotor:

1/ Considering the lighttpd gets stopped right after this:
Aug 26 13:27:28 xxx.xxx.xxx.xxx minicron: (/etc/rc.prunecaptiveportal) terminated by signal 15 (Terminated: 15)
you might turn off the captive portal as well. Does not even look like it's killed, just regular stop.

2/ Darkstat can ONLY listen on one interface; at least until it gets updated.

P.S. Using exact same board on multiple places with 2.1RC with no GUI crashes at all.

OK, will give it a try. I've just had another crash, so Darkstat is not causing the problems.

Aug 26 13:48:48 xxx.xxx.xxx.xxx check_reload_status: Syncing firewall
Aug 26 13:48:49 xxx.xxx.xxx.xxx logportalauth[27062]: Restarting captive portal.
Aug 26 13:48:49 xxx.xxx.xxx.xxx kernel: ipfw2 (+ipv6) initialized, divert loadable, nat loadable, rule-based forwarding enabled, default to accept, logging disabled
Aug 26 13:48:51 xxx.xxx.xxx.xxx check_reload_status: Reloading filter
Aug 26 13:48:56 xxx.xxx.xxx.xxx php: : MONITOR: GW3G is down, removing from routing group
Aug 26 13:49:16 xxx.xxx.xxx.xxx logportalauth[1184]: LOGIN: orbit, 00:0c:29:ca:be:91, 172.16.0.100
Aug 26 13:50:44 xxx.xxx.xxx.xxx logportalauth[1184]: FAILURE: orbit, 00:0c:29:ca:be:91, 172.16.0.100
Aug 26 13:51:06 xxx.xxx.xxx.xxx lighttpd[63368]: (server.c.1546) server stopped by UID = 0 PID = 54459
Aug 26 13:51:06 xxx.xxx.xxx.xxx lighttpd[63368]: (server.c.1546) server stopped by UID = 0 PID = 54459
Aug 26 13:51:06 xxx.xxx.xxx.xxx check_reload_status: Syncing firewall
Aug 26 13:51:07 xxx.xxx.xxx.xxx minicron: (/etc/rc.prunecaptiveportal) terminated by signal 15 (Terminated: 15)
Aug 26 13:51:07 xxx.xxx.xxx.xxx kernel: IP firewall unloaded
Aug 26 13:51:08 xxx.xxx.xxx.xxx check_reload_status: Reloading filter
Aug 26 13:51:13 xxx.xxx.xxx.xxx logportalauth[27062]: Restarting captive portal.
Aug 26 13:51:13 xxx.xxx.xxx.xxx kernel: ipfw2 (+ipv6) initialized, divert loadable, nat loadable, rule-based forwarding enabled, default to accept, logging disabled
Aug 26 13:51:15 xxx.xxx.xxx.xxx php: : MONITOR: GW3G is down, removing from routing group
Aug 26 13:51:18 xxx.xxx.xxx.xxx lighttpd[25214]: (network_writev.c.112) writev failed: Operation not permitted 14
Aug 26 13:51:18 xxx.xxx.xxx.xxx lighttpd[25214]: (network_writev.c.112) writev failed: Operation not permitted 14
Aug 26 13:51:18 xxx.xxx.xxx.xxx lighttpd[25214]: (connections.c.637) connection closed: write failed on fd 14
Aug 26 13:51:18 xxx.xxx.xxx.xxx lighttpd[25214]: (connections.c.637) connection closed: write failed on fd 14

am turning the captive portal off (just set it up 10 minutes ago, webGUI crashes I've had for days, so CP alone is also not the problem).

doktornotor

Something is regularly STOPPING your webserver.

Aug 26 13:51:06 xxx.xxx.xxx.xxx lighttpd[63368]: (server.c.1546) server stopped by UID = 0 PID = 54459
Aug 26 13:51:06 xxx.xxx.xxx.xxx lighttpd[63368]: (server.c.1546) server stopped by UID = 0 PID = 54459

So you need find out what process has that PID that appears on those log lines.

Also, from that log, I cannot see how you disabled the captive portal, the log suggests pretty clear is it NOT disabled at all.


Aug 26 13:48:49 xxx.xxx.xxx.xxx logportalauth[27062]: Restarting captive portal.
Aug 26 13:49:16 xxx.xxx.xxx.xxx logportalauth[1184]: LOGIN: orbit, 00:0c:29:ca:be:91, 172.16.0.100
Aug 26 13:50:44 xxx.xxx.xxx.xxx logportalauth[1184]: FAILURE: orbit, 00:0c:29:ca:be:91, 172.16.0.100
Aug 26 13:51:07 xxx.xxx.xxx.xxx minicron: (/etc/rc.prunecaptiveportal) terminated by signal 15 (Terminated: 15)
Aug 26 13:51:13 xxx.xxx.xxx.xxx logportalauth[27062]: Restarting captive portal.

You also apparently have some networking issues:


Aug 26 13:51:15 xxx.xxx.xxx.xxx php: : MONITOR: GW3G is down, removing from routing group

before the lighttpd closes the connection:


Aug 26 13:51:18 xxx.xxx.xxx.xxx lighttpd[25214]: (network_writev.c.112) writev failed: Operation not permitted 14
Aug 26 13:51:18 xxx.xxx.xxx.xxx lighttpd[25214]: (network_writev.c.112) writev failed: Operation not permitted 14
Aug 26 13:51:18 xxx.xxx.xxx.xxx lighttpd[25214]: (connections.c.637) connection closed: write failed on fd 14
Aug 26 13:51:18 xxx.xxx.xxx.xxx lighttpd[25214]: (connections.c.637) connection closed: write failed on fd 14

(Well, no wonder it seems that it "crashes" when your network is down.)

marama

@doktornotor:

Something is regularly STOPPING your webserver.

Aug 26 13:51:06 xxx.xxx.xxx.xxx lighttpd[63368]: (server.c.1546) server stopped by UID = 0 PID = 54459
Aug 26 13:51:06 xxx.xxx.xxx.xxx lighttpd[63368]: (server.c.1546) server stopped by UID = 0 PID = 54459

So you need find out what process has that PID that appears on those log lines.

Hm, how could I do that? I tried by running top and simply finding the PID on the screen when crash occures (just did), but the PID was not on the screen. Either the killer PID was low in usage so he didn't show in top, or it was an ad-hoc process that didn't exist before. Can I somehow send PID info upon process generation to the syslog server? Simply cronjobing "ps -aux" will probably not be effective.

BTW, since there is no point in using Alix board if I cannot use Darkstat or Captive Portal, I turned both of them on and am looking for the killer-PID.

Also, from that log, I cannot see how you disabled the captive portal, the log suggests pretty clear is it NOT disabled at all.

Sorry for not being clear, the log was made BEFORE I've turned the captive portal off, that explains the CP entries in the log. Just for clarification, Alix Board should be able to cope with captive portal, right?

You also apparently have some networking issues:
Aug 26 13:51:15 xxx.xxx.xxx.xxx php: : MONITOR: GW3G is down, removing from routing group

I could of course remove the Gateway Group entry, but the line is there everytime I have a crash. I will remove it and watch.

before the lighttpd closes the connection:


Aug 26 13:51:18 xxx.xxx.xxx.xxx lighttpd[25214]: (network_writev.c.112) writev failed: Operation not permitted 14
Aug 26 13:51:18 xxx.xxx.xxx.xxx lighttpd[25214]: (network_writev.c.112) writev failed: Operation not permitted 14
Aug 26 13:51:18 xxx.xxx.xxx.xxx lighttpd[25214]: (connections.c.637) connection closed: write failed on fd 14
Aug 26 13:51:18 xxx.xxx.xxx.xxx lighttpd[25214]: (connections.c.637) connection closed: write failed on fd 14

(Well, no wonder it seems that it "crashes" when your network is down.)

It was a dirty log, don't remember what I've done at that exact moment. But thanks for the hint, I will try to exclude the networking issue by bringing the pfSense box and attaching it to my workstation directly. Though, we have some 60 Workstations on the switch, no problems. No firewall rules or any limiter entries that could cause problems from pfSense side.

Thanx for helping me out ;)

doktornotor

Please, post something useful, not "dirty" logs. The log shows that your network crashes and that's pretty much it. I'd suggest wiping the config and restarting form scratch.

marama

@doktornotor:

Please, post something useful, not "dirty" logs. The log shows that your network crashes and that's pretty much it. I'd suggest wiping the config and restarting form scratch.

With "dirty" I ment I was constantly restarting services (darkstat, captive portal…) so I was not clear if I made the logs dirty by my own actions.
The installation is from scratch, I only had Darkstat running for few days and tried to get captive portal running today. I also have some basic port forwards. No additional services or some weird settings. I've tried the fail over route configuration, it was running on WAN and it was no math science so I excluded that as cause for my WebGUI problems, but in order to eliminate the causes, I removed it too (as you indirectly suggested). Am out of office now, but I have another CF I will copy pfSense to and will give it a go tomorrow from scratch, will let you know how it goes.

For me it would be important to have an idea if I am overstretching the hardware, should Alix board be able to handle 12 Mbit internet connection, some firewall rules, RRD graphs (default ones), captive portal, DHCP Server, DNS forwarder, Darkstat monitoring and Gateway Failover, maybe 5 VLANs... Alix board should be able to handle that easily, right?

Anyway, thnx a lot for helping. My main suspect is the gateway I removed, will test tomorrow and do a clean install if necessary. Will let you know how it goes.

kejianshi

Perhaps you have a hardware problem? Something about to fail?

doktornotor

@kejianshi:

Perhaps you have a hardware problem? Something about to fail?

Most likely. Though, with statements like "The installation is from scratch" and "My main suspect is the gateway I removed"… ::)

kejianshi

doktornotor - You missed your calling as depression counselor… :D

doktornotor

@kejianshi:

doktornotor - You missed your calling as depression counselor… :D

marama

@doktornotor:

@kejianshi:

Perhaps you have a hardware problem? Something about to fail?

Most likely. Though, with statements like "The installation is from scratch" and "My main suspect is the gateway I removed"… ::)

Hardware… hope not because it would be difficult to diagnose, hope it's a topology configuration problem (outside pfSense), I still don't know. We have a 3 line DSL that has one line down, waiting for ISP to fix that. But it's on WAN side so I don't think it could crash the WebGUI just like that. The thing with gateway is that I had a gateway on LAN, so main Gateway on WAN side, and failover on LAN side. I've tested the failover and it worked, but I might have concluded to easily that it's the right way to do. Anyway, I bought another managed switch so I will be able to get 2 WAN ports and have a clean installation on that side. I think there is not much point in pursuing the WebGUI issues before making sure the environment is the right one. But I'll repeat once more, I've really done nothing "unusual" to pfSense, I wouldn't expect much from install from scratch, the few settings I've made shouldn't be crashing the WebGUI.
I hope it was the failover gateway configuration causing problems.

kejianshi

I think having all your gateways on WAN is a good idea…

Hope it goes well when you get your new equipment.