Captive portal random deaths

carzin

I've been running pfSense for years as a captive portal for wireless onboarding at a very large U.S. university. The platform is incredibly good at what it does, but I have been plagues by seemingly random captive portal deaths since I have been running it. This is independent on platform or build (I've had this problem when it was physical, and still have the problem when my 4 instances are virtualized…).

This is what is looks like to the user... They connect to the onboarding SSID, get their IP address, DNS and gateway (all pointing to the pfSense LAN interface). Normally, when they open a browser, they are redirected to an authentication page, and upon authenticating, they are redirected to another site. Very standard. In my configuration, I use the DNS Forwarder service.

When it fails, when they open the website, they are not redirected. I am still able to login to the pfSense management portal, and it will show that at some point (through the RRD graphs), the captive portal stopped authenticating users. A simple reboot of the box brings things back to life.

I thought that possibly the Captive Portal process or the dnsmasq process was stopping, so I installed the service watchdog and had it monitor and automatically recycle those two processes if they fail. For some reason, I have never been in the situation to have a user do an nslookup to see if the DNS responder is working or not when they are unable to bring up the redirect page (on the list for next time).

What else do I need to look for? I don't especially enjoy getting paged at random times with someone telling me the campus is unable to onboard.

The boxes are running with 8 gigs of memory, and 2 cores, and are heavily underutilized.

cal2600

I have the same issue here. It just happened 5 minutes ago when I was in the management portal. I did not get disconnected but all of my portal users did.

cmb

A reboot can fix all kinds of things unrelated to the firewall itself (IP conflicts, MAC conflicts, switch or hardware issues, among other possibilities), so you'll need to troubleshoot further. Packet capture on the interface where the clients reside, what do you see? Anything in the system or portal auth logs?

@cal2600:

I have the same issue here. It just happened 5 minutes ago when I was in the management portal. I did not get disconnected but all of my portal users did.

You almost certainly don't have the same issue, please start a new thread describing what you were doing and what happened. If you're on an old version (IIRC pre-2.2.x, but maybe pre-2.1.x), all users will be disconnected upon making any changes in the captive portal config, so that might be all that happened there.

carzin

Ok. So we had another event yesterday. The captive portal was throwing Internal Server Error. This was the top 50 from the server log. What do I need to look at next?

Last 50 system log entries
Oct 26 15:21:21 lighttpd[32515]: (mod_fastcgi.c.1754) connect failed: Connection refused on unix:/tmp/php-fastcgi-cpzone.socket-2
Oct 26 15:21:21 lighttpd[32515]: (mod_fastcgi.c.3021) backend died; we'll disable it for 1 seconds and send the request to another backend instead: reconnects: 1 load: 1085
Oct 26 15:21:21 lighttpd[32515]: (mod_fastcgi.c.1754) connect failed: Connection refused on unix:/tmp/php-fastcgi-cpzone.socket-0
Oct 26 15:21:21 lighttpd[32515]: (mod_fastcgi.c.3021) backend died; we'll disable it for 1 seconds and send the request to another backend instead: reconnects: 2 load: 1085
Oct 26 15:21:21 lighttpd[32515]: (mod_fastcgi.c.3587) all handlers for /index.php?zone=cpzone&redirurl=/sj/data.gif?intype=32&andver=5.0&rom=0&actionname=kbd_main&kong=0&imei=351776064868573&mcc=310&serial=b888b894&root=0&prodid=2&channel=10000014&kvercode=4171000&androidid=221da9d43d0fbaaa&pid=3517760648685737445ee16a99140d388c9ae9ca3046d34&did=ims8xwf5fexpfhgwtmlsw54jqwhl&mac=48:5A:3F:03:6A:3F&busi_type=2&intime=20140502&newer=0&osver=21&cl=en&click=charging_dialog_show&display=10801920&brand=samsung&mode=SM-N9005&kbdver=4.17.1&gaid=7e7ba8fd-f29b-4656-a0a1-9e864c89df3c on .php are down.
Oct 26 15:21:23 php-fpm[93200]: /index.php: Successful login for user 'admin' from: REMOVED
Oct 26 15:21:23 php-fpm[93200]: /index.php: Successful login for user 'admin' from: REMOVED
Oct 26 15:21:24 lighttpd[32515]: (mod_fastcgi.c.2779) fcgi-server re-enabled: 0 /tmp/php-fastcgi-cpzone.socket
Oct 26 15:21:24 lighttpd[32515]: (mod_fastcgi.c.2779) fcgi-server re-enabled: 0 /tmp/php-fastcgi-cpzone.socket
Oct 26 15:21:24 lighttpd[32515]: (mod_fastcgi.c.2779) fcgi-server re-enabled: 0 /tmp/php-fastcgi-cpzone.socket
Oct 26 15:21:24 lighttpd[32515]: (mod_fastcgi.c.1754) connect failed: Connection refused on unix:/tmp/php-fastcgi-cpzone.socket-4
Oct 26 15:21:24 lighttpd[32515]: (mod_fastcgi.c.3021) backend died; we'll disable it for 1 seconds and send the request to another backend instead: reconnects: 0 load: 1085
Oct 26 15:21:24 lighttpd[32515]: (mod_fastcgi.c.1754) connect failed: Connection refused on unix:/tmp/php-fastcgi-cpzone.socket-2
Oct 26 15:21:24 lighttpd[32515]: (mod_fastcgi.c.3021) backend died; we'll disable it for 1 seconds and send the request to another backend instead: reconnects: 1 load: 1085
Oct 26 15:21:24 lighttpd[32515]: (mod_fastcgi.c.1754) connect failed: Connection refused on unix:/tmp/php-fastcgi-cpzone.socket-0
Oct 26 15:21:24 lighttpd[32515]: (mod_fastcgi.c.3021) backend died; we'll disable it for 1 seconds and send the request to another backend instead: reconnects: 2 load: 1085
Oct 26 15:21:24 lighttpd[32515]: (mod_fastcgi.c.3587) all handlers for /index.php?zone=cpzone&redirurl=/sj/data.gif?intype=32&andver=5.0&rom=0&actionname=kbd_main&kong=0&imei=351776064868573&mcc=310&serial=b888b894&root=0&prodid=2&channel=10000014&kvercode=4171000&androidid=221da9d43d0fbaaa&pid=3517760648685737445ee16a99140d388c9ae9ca3046d34&did=ims8xwf5fexpfhgwtmlsw54jqwhl&mac=48:5A:3F:03:6A:3F&busi_type=2&intime=20140502&newer=0&osver=21&cl=en&click=REPORT_ACTIVE_UM_V5&display=10801920&brand=samsung&mode=SM-N9005&kbdver=4.17.1&gaid=7e7ba8fd-f29b-4656-a0a1-9e864c89df3c on .php are down.
Oct 26 15:21:24 lighttpd[32515]: (request.c.1125) POST-request, but content-length missing -> 411
Oct 26 15:21:27 lighttpd[32515]: (mod_fastcgi.c.2779) fcgi-server re-enabled: 0 /tmp/php-fastcgi-cpzone.socket
Oct 26 15:21:27 lighttpd[32515]: (mod_fastcgi.c.2779) fcgi-server re-enabled: 0 /tmp/php-fastcgi-cpzone.socket
Oct 26 15:21:27 lighttpd[32515]: (mod_fastcgi.c.2779) fcgi-server re-enabled: 0 /tmp/php-fastcgi-cpzone.socket
Oct 26 15:21:27 lighttpd[32515]: (mod_fastcgi.c.1754) connect failed: Connection refused on unix:/tmp/php-fastcgi-cpzone.socket-4
Oct 26 15:21:27 lighttpd[32515]: (mod_fastcgi.c.3021) backend died; we'll disable it for 1 seconds and send the request to another backend instead: reconnects: 0 load: 1085
Oct 26 15:21:27 lighttpd[32515]: (mod_fastcgi.c.1754) connect failed: Connection refused on unix:/tmp/php-fastcgi-cpzone.socket-2
Oct 26 15:21:27 lighttpd[32515]: (mod_fastcgi.c.3021) backend died; we'll disable it for 1 seconds and send the request to another backend instead: reconnects: 1 load: 1085
Oct 26 15:21:27 lighttpd[32515]: (mod_fastcgi.c.1754) connect failed: Connection refused on unix:/tmp/php-fastcgi-cpzone.socket-0
Oct 26 15:21:27 lighttpd[32515]: (mod_fastcgi.c.3021) backend died; we'll disable it for 1 seconds and send the request to another backend instead: reconnects: 2 load: 1085
Oct 26 15:21:27 lighttpd[32515]: (mod_fastcgi.c.3587) all handlers for /index.php?zone=cpzone&redirurl=/sj/data.gif?intype=32&andver=5.0&rom=0&actionname=kbd_main&kong=0&imei=351776064868573&mcc=310&serial=b888b894&root=0&prodid=2&channel=10000014&kvercode=4171000&androidid=221da9d43d0fbaaa&pid=3517760648685737445ee16a99140d388c9ae9ca3046d34&did=ims8xwf5fexpfhgwtmlsw54jqwhl&mac=48:5A:3F:03:6A:3F&busi_type=2&intime=20140502&newer=0&osver=21&cl=en&display=10801920&brand=samsung&mode=SM-N9005&kbdver=4.17.1&gaid=7e7ba8fd-f29b-4656-a0a1-9e864c89df3c&REPORT_ACTIVE=SelfAlarm_1445872006093_1445872005790_rescd_500 on .php are down.
Oct 26 15:21:30 lighttpd[32515]: (mod_fastcgi.c.2779) fcgi-server re-enabled: 0 /tmp/php-fastcgi-cpzone.socket
Oct 26 15:21:30 lighttpd[32515]: (mod_fastcgi.c.2779) fcgi-server re-enabled: 0 /tmp/php-fastcgi-cpzone.socket
Oct 26 15:21:30 lighttpd[32515]: (mod_fastcgi.c.2779) fcgi-server re-enabled: 0 /tmp/php-fastcgi-cpzone.socket
Oct 26 15:21:30 lighttpd[32515]: (mod_fastcgi.c.1754) connect failed: Connection refused on unix:/tmp/php-fastcgi-cpzone.socket-4
Oct 26 15:21:30 lighttpd[32515]: (mod_fastcgi.c.3021) backend died; we'll disable it for 1 seconds and send the request to another backend instead: reconnects: 0 load: 1085
Oct 26 15:21:30 lighttpd[32515]: (mod_fastcgi.c.1754) connect failed: Connection refused on unix:/tmp/php-fastcgi-cpzone.socket-2
Oct 26 15:21:30 lighttpd[32515]: (mod_fastcgi.c.3021) backend died; we'll disable it for 1 seconds and send the request to another backend instead: reconnects: 1 load: 1085
Oct 26 15:21:30 lighttpd[32515]: (mod_fastcgi.c.1754) connect failed: Connection refused on unix:/tmp/php-fastcgi-cpzone.socket-0
Oct 26 15:21:30 lighttpd[32515]: (mod_fastcgi.c.3021) backend died; we'll disable it for 1 seconds and send the request to another backend instead: reconnects: 2 load: 1085
Oct 26 15:21:30 lighttpd[32515]: (mod_fastcgi.c.3587) all handlers for /index.php?zone=cpzone&redirurl=/sj/data.gif?intype=32&andver=5.0&rom=0&actionname=kbd_main&kong=0&imei=351776064868573&mcc=310&serial=b888b894&root=0&prodid=2&channel=10000014&kvercode=4171000&androidid=221da9d43d0fbaaa&pid=3517760648685737445ee16a99140d388c9ae9ca3046d34&did=ims8xwf5fexpfhgwtmlsw54jqwhl&mac=48:5A:3F:03:6A:3F&busi_type=2&intime=20140502&newer=0&osver=21&cl=en&display=10801920&brand=samsung&mode=SM-N9005&kbdver=4.17.1&gaid=7e7ba8fd-f29b-4656-a0a1-9e864c89df3c&REPORT_ACTIVE=SelfAlarm_1445872006093_1445872005790_rescd_500 on .php are down.
Oct 26 15:21:30 lighttpd[32515]: (mod_evasive.c.183) 172.18.8.102 turned away. Too many connections.
Oct 26 15:21:33 lighttpd[32515]: (mod_fastcgi.c.2779) fcgi-server re-enabled: 0 /tmp/php-fastcgi-cpzone.socket
Oct 26 15:21:33 lighttpd[32515]: (mod_fastcgi.c.2779) fcgi-server re-enabled: 0 /tmp/php-fastcgi-cpzone.socket
Oct 26 15:21:33 lighttpd[32515]: (mod_fastcgi.c.2779) fcgi-server re-enabled: 0 /tmp/php-fastcgi-cpzone.socket
Oct 26 15:21:33 lighttpd[32515]: (mod_fastcgi.c.1754) connect failed: Connection refused on unix:/tmp/php-fastcgi-cpzone.socket-4
Oct 26 15:21:33 kernel: sonewconn: pcb 0xfffff8002c506e10: Listen queue overflow: 193 already in queue awaiting acceptance (63 occurrences)
Oct 26 15:21:33 lighttpd[32515]: (mod_fastcgi.c.3021) backend died; we'll disable it for 1 seconds and send the request to another backend instead: reconnects: 0 load: 1085
Oct 26 15:21:33 lighttpd[32515]: (mod_fastcgi.c.1754) connect failed: Connection refused on unix:/tmp/php-fastcgi-cpzone.socket-2
Oct 26 15:21:33 lighttpd[32515]: (mod_fastcgi.c.3021) backend died; we'll disable it for 1 seconds and send the request to another backend instead: reconnects: 1 load: 1085
Oct 26 15:21:33 lighttpd[32515]: (mod_fastcgi.c.1754) connect failed: Connection refused on unix:/tmp/php-fastcgi-cpzone.socket-0
Oct 26 15:21:33 lighttpd[32515]: (mod_fastcgi.c.3021) backend died; we'll disable it for 1 seconds and send the request to another backend instead: reconnects: 2 load: 1085
Oct 26 15:21:33 lighttpd[32515]: (mod_fastcgi.c.3587) all handlers for /index.php?zone=cpzone&redirurl=/sj/data.gif?intype=32&andver=5.0&rom=0&actionname=kbd_main&kong=0&imei=351776064868573&mcc=310&serial=b888b894&root=0&prodid=2&channel=10000014&kvercode=4171000&androidid=221da9d43d0fbaaa&pid=3517760648685737445ee16a99140d388c9ae9ca3046d34&did=ims8xwf5fexpfhgwtmlsw54jqwhl&mac=48:5A:3F:03:6A:3F&busi_type=2&intime=20140502&newer=0&osver=21&cl=en&click=charging_dialog_show&display=1080*1920&brand=samsung&mode=SM-N9005&kbdver=4.17.1&gaid=7e7ba8fd-f29b-4656-a0a1-9e864c89df3c on .php are down.

cmb

Root cause there is PHP's dying. With fastcgi, I guess that's 2.1.x or older version on there. Upgrade to 2.2.4 first, php-fpm is better in that regard if it's some scalability issue, and you could be triggering some problem in the old PHP version.

Gertjan

This guy:
@carzin:

Oct 26 15:21:30 lighttpd[32515]: (mod_evasive.c.183) 172.18.8.102 turned away. Too many connections.

is it a client on the captive portal ?

If so, its probably a case of a lousy written 'app' that doesn't understand what a 'portal' is and hammering your your portal. The portal send over a 'login page', the client (172.18.8.102) doesn't want that page, and keeps asking again and again …. up until 'no more resources' and PHP breaks.

But, hey, that's just a thought. Can't remember well these issues with ancient versions ;)

cmb

@Gertjan:

This guy:
@carzin:

Oct 26 15:21:30 lighttpd[32515]: (mod_evasive.c.183) 172.18.8.102 turned away. Too many connections.

is it a client on the captive portal ?

If so, its probably a case of a lousy written 'app' that doesn't understand what a 'portal' is and hammering your your portal. The portal send over a 'login page', the client (172.18.8.102) doesn't want that page, and keeps asking again and again …. up until 'no more resources' and PHP breaks.

Yes, that would be a client. The fact the client connections limit is being met should prevent it from exhausting the PHP resources. But, that is along the lines of what I was thinking, except that something it was doing repeatedly caused PHP to crash rather than just run out of resources.

carzin

All: this box was running 2.2.4. So I'm on the latest and greatest. I've had this problem since we started using pfsense years ago, across multiple builds.

Gertjan

It's probably not PHP. On a lower level you have this:

Oct 26 15:21:33 kernel: sonewconn: pcb 0xfffff8002c506e10: Listen queue overflow: 193 already in queue awaiting acceptance (63 occurrences)

Google FreeBSD + sonewconn (so you know that you are not the only one), try what the first link proposes.

Other links will help you nailing down the process - port - etc.

carzin

I need some spoon feeding. I am not a Linux guru. From the searches, I ran the following command (netstat -Lan) and saw a bunch of:

tcpX 0/0/128 which should tell me the queue size is 128.

The instructions tell you to issue the command:
sysctl kern.ipc.somaxconn=2048 and I get a readout of:

kern.ipc.somaxconn:128 -> 2048

However, when I run the netstat -Lan command again, it still shows a queue value of 128. What else do I need to do?

Gertjan

@carzin:

I need some spoon feeding. I am not a Linux guru. …..

It even worse, Linux is not FreeBSD (at all).

Anyway, without putting my hands on your system, I can not explain why your identical pfSense is behaving differently as mine.
Adapting the queues is just a counter measure because
-> Your system can't handle the load (the queues are filing up without pfSense being able to handle it)
or
-> (so) analyze this 'load' … whats coming into your pfSense ? Is it the WAN , LAN ? other interface that is flooding ?

Can you limit the number of user ?

Can tcpdump tell you something ?

What did you change from the default setup ?

Note that I'm not a network expert neither, but these are the steps that I would take to dig up the problem.

carzin

Well, there isn't much I can do to limit the users. The pfSense virtual machines (4 of them) are what I use to authenticate users when they connect to a setup SSID and funnel them to the appropriate configuration website. I use the DNS forwarding functionality to limit what they have access to after they connect. So, I have no control over how the users connect, or really, how many connect.

I suspect I see a lot more load on my boxes than most of you. At peak, I can have 100s of users connecting through at a single instance. And the box works just fine with that load. The pfSense death happens for apparently no reason, and is not generally associated with load. Which is why I liked the idea of a 'bad client' basically beating the hell outta the server until it dies.

Gertjan

Just a thought.

You said:

Well, there isn't much I can do to limit the users

but you really 'nag' them with this:

I use the DNS forwarding functionality to limit what they have access to after they connect.

What I make of it:
The users device knows it is connected (there is a DNS server, a gateway) : the link seems up.
But may DNS requests will not receive a reply - or a wrong reply.
What does the 'app' doing with this situation ?? A request to resolve i.e. facebook.com will yield many retries because it 'won't work'.

So: use tcpdump incoming port 53 - protocol UDP and TCP to see if your DNS resolver get swamped …

=> This is just an idea ....

carzin

This is fun. Another zone, different from the last time, died. And this is in the syslog:

Nov 1 10:58:17 lighttpd[34493]: (connections.c.305) SSL: 1 error:1408A0C1:SSL routines:SSL3_GET_CLIENT_HELLO:no shared cipher
Nov 1 10:58:17 lighttpd[34493]: (connections.c.305) SSL: 1 error:1408A10B:SSL routines:SSL3_GET_CLIENT_HELLO:wrong version number
Nov 1 11:08:19 lighttpd[34493]: (connections.c.305) SSL: 1 error:1408A0C1:SSL routines:SSL3_GET_CLIENT_HELLO:no shared cipher
Nov 1 11:08:19 lighttpd[34493]: (connections.c.305) SSL: 1 error:1408A10B:SSL routines:SSL3_GET_CLIENT_HELLO:wrong version number
Nov 1 11:18:21 lighttpd[34493]: (connections.c.305) SSL: 1 error:1408A0C1:SSL routines:SSL3_GET_CLIENT_HELLO:no shared cipher
Nov 1 11:18:21 lighttpd[34493]: (connections.c.305) SSL: 1 error:1408A10B:SSL routines:SSL3_GET_CLIENT_HELLO:wrong version number
Nov 1 11:28:23 lighttpd[34493]: (connections.c.305) SSL: 1 error:1408A0C1:SSL routines:SSL3_GET_CLIENT_HELLO:no shared cipher
Nov 1 11:28:23 lighttpd[34493]: (connections.c.305) SSL: 1 error:1408A10B:SSL routines:SSL3_GET_CLIENT_HELLO:wrong version number
Nov 1 11:38:25 lighttpd[34493]: (connections.c.305) SSL: 1 error:1408A0C1:SSL routines:SSL3_GET_CLIENT_HELLO:no shared cipher
Nov 1 11:38:25 lighttpd[34493]: (connections.c.305) SSL: 1 error:1408A10B:SSL routines:SSL3_GET_CLIENT_HELLO:wrong version number
Nov 1 11:48:27 lighttpd[34493]: (connections.c.305) SSL: 1 error:1408A0C1:SSL routines:SSL3_GET_CLIENT_HELLO:no shared cipher
Nov 1 11:48:27 lighttpd[34493]: (connections.c.305) SSL: 1 error:1408A10B:SSL routines:SSL3_GET_CLIENT_HELLO:wrong version number
Nov 1 11:51:04 lighttpd[34493]: (connections.c.305) SSL: 1 error:1407609C:SSL routines:SSL23_GET_CLIENT_HELLO:http request
Nov 1 11:54:23 lighttpd[34493]: (connections.c.305) SSL: 1 error:1407609C:SSL routines:SSL23_GET_CLIENT_HELLO:http request
Nov 1 11:58:29 lighttpd[34493]: (connections.c.305) SSL: 1 error:1408A0C1:SSL routines:SSL3_GET_CLIENT_HELLO:no shared cipher
Nov 1 11:58:29 lighttpd[34493]: (connections.c.305) SSL: 1 error:1408A10B:SSL routines:SSL3_GET_CLIENT_HELLO:wrong version number
Nov 1 12:02:27 lighttpd[34493]: (connections.c.305) SSL: 1 error:1407609C:SSL routines:SSL23_GET_CLIENT_HELLO:http request
Nov 1 12:05:33 lighttpd[34493]: (connections.c.305) SSL: 1 error:1407609C:SSL routines:SSL23_GET_CLIENT_HELLO:http request
Nov 1 12:08:31 lighttpd[34493]: (connections.c.305) SSL: 1 error:1408A0C1:SSL routines:SSL3_GET_CLIENT_HELLO:no shared cipher
Nov 1 12:08:31 lighttpd[34493]: (connections.c.305) SSL: 1 error:1408A10B:SSL routines:SSL3_GET_CLIENT_HELLO:wrong version number

Gertjan

Probably a client connection to a '443' (https) not using a https 'talk'.