Captive portal random deaths

cmb

Root cause there is PHP's dying. With fastcgi, I guess that's 2.1.x or older version on there. Upgrade to 2.2.4 first, php-fpm is better in that regard if it's some scalability issue, and you could be triggering some problem in the old PHP version.

Gertjan

This guy:
@carzin:

Oct 26 15:21:30 lighttpd[32515]: (mod_evasive.c.183) 172.18.8.102 turned away. Too many connections.

is it a client on the captive portal ?

If so, its probably a case of a lousy written 'app' that doesn't understand what a 'portal' is and hammering your your portal. The portal send over a 'login page', the client (172.18.8.102) doesn't want that page, and keeps asking again and again …. up until 'no more resources' and PHP breaks.

But, hey, that's just a thought. Can't remember well these issues with ancient versions ;)

cmb

@Gertjan:

This guy:
@carzin:

Oct 26 15:21:30 lighttpd[32515]: (mod_evasive.c.183) 172.18.8.102 turned away. Too many connections.

is it a client on the captive portal ?

If so, its probably a case of a lousy written 'app' that doesn't understand what a 'portal' is and hammering your your portal. The portal send over a 'login page', the client (172.18.8.102) doesn't want that page, and keeps asking again and again …. up until 'no more resources' and PHP breaks.

Yes, that would be a client. The fact the client connections limit is being met should prevent it from exhausting the PHP resources. But, that is along the lines of what I was thinking, except that something it was doing repeatedly caused PHP to crash rather than just run out of resources.

carzin

All: this box was running 2.2.4. So I'm on the latest and greatest. I've had this problem since we started using pfsense years ago, across multiple builds.

Gertjan

It's probably not PHP. On a lower level you have this:

Oct 26 15:21:33 kernel: sonewconn: pcb 0xfffff8002c506e10: Listen queue overflow: 193 already in queue awaiting acceptance (63 occurrences)

Google FreeBSD + sonewconn (so you know that you are not the only one), try what the first link proposes.

Other links will help you nailing down the process - port - etc.

carzin

I need some spoon feeding. I am not a Linux guru. From the searches, I ran the following command (netstat -Lan) and saw a bunch of:

tcpX 0/0/128 which should tell me the queue size is 128.

The instructions tell you to issue the command:
sysctl kern.ipc.somaxconn=2048 and I get a readout of:

kern.ipc.somaxconn:128 -> 2048

However, when I run the netstat -Lan command again, it still shows a queue value of 128. What else do I need to do?

Gertjan

@carzin:

I need some spoon feeding. I am not a Linux guru. …..

It even worse, Linux is not FreeBSD (at all).

Anyway, without putting my hands on your system, I can not explain why your identical pfSense is behaving differently as mine.
Adapting the queues is just a counter measure because
-> Your system can't handle the load (the queues are filing up without pfSense being able to handle it)
or
-> (so) analyze this 'load' … whats coming into your pfSense ? Is it the WAN , LAN ? other interface that is flooding ?

Can you limit the number of user ?

Can tcpdump tell you something ?

What did you change from the default setup ?

Note that I'm not a network expert neither, but these are the steps that I would take to dig up the problem.

carzin

Well, there isn't much I can do to limit the users. The pfSense virtual machines (4 of them) are what I use to authenticate users when they connect to a setup SSID and funnel them to the appropriate configuration website. I use the DNS forwarding functionality to limit what they have access to after they connect. So, I have no control over how the users connect, or really, how many connect.

I suspect I see a lot more load on my boxes than most of you. At peak, I can have 100s of users connecting through at a single instance. And the box works just fine with that load. The pfSense death happens for apparently no reason, and is not generally associated with load. Which is why I liked the idea of a 'bad client' basically beating the hell outta the server until it dies.

Gertjan

Just a thought.

You said:

Well, there isn't much I can do to limit the users

but you really 'nag' them with this:

I use the DNS forwarding functionality to limit what they have access to after they connect.

What I make of it:
The users device knows it is connected (there is a DNS server, a gateway) : the link seems up.
But may DNS requests will not receive a reply - or a wrong reply.
What does the 'app' doing with this situation ?? A request to resolve i.e. facebook.com will yield many retries because it 'won't work'.

So: use tcpdump incoming port 53 - protocol UDP and TCP to see if your DNS resolver get swamped …

=> This is just an idea ....

carzin

This is fun. Another zone, different from the last time, died. And this is in the syslog:

Nov 1 10:58:17 lighttpd[34493]: (connections.c.305) SSL: 1 error:1408A0C1:SSL routines:SSL3_GET_CLIENT_HELLO:no shared cipher
Nov 1 10:58:17 lighttpd[34493]: (connections.c.305) SSL: 1 error:1408A10B:SSL routines:SSL3_GET_CLIENT_HELLO:wrong version number
Nov 1 11:08:19 lighttpd[34493]: (connections.c.305) SSL: 1 error:1408A0C1:SSL routines:SSL3_GET_CLIENT_HELLO:no shared cipher
Nov 1 11:08:19 lighttpd[34493]: (connections.c.305) SSL: 1 error:1408A10B:SSL routines:SSL3_GET_CLIENT_HELLO:wrong version number
Nov 1 11:18:21 lighttpd[34493]: (connections.c.305) SSL: 1 error:1408A0C1:SSL routines:SSL3_GET_CLIENT_HELLO:no shared cipher
Nov 1 11:18:21 lighttpd[34493]: (connections.c.305) SSL: 1 error:1408A10B:SSL routines:SSL3_GET_CLIENT_HELLO:wrong version number
Nov 1 11:28:23 lighttpd[34493]: (connections.c.305) SSL: 1 error:1408A0C1:SSL routines:SSL3_GET_CLIENT_HELLO:no shared cipher
Nov 1 11:28:23 lighttpd[34493]: (connections.c.305) SSL: 1 error:1408A10B:SSL routines:SSL3_GET_CLIENT_HELLO:wrong version number
Nov 1 11:38:25 lighttpd[34493]: (connections.c.305) SSL: 1 error:1408A0C1:SSL routines:SSL3_GET_CLIENT_HELLO:no shared cipher
Nov 1 11:38:25 lighttpd[34493]: (connections.c.305) SSL: 1 error:1408A10B:SSL routines:SSL3_GET_CLIENT_HELLO:wrong version number
Nov 1 11:48:27 lighttpd[34493]: (connections.c.305) SSL: 1 error:1408A0C1:SSL routines:SSL3_GET_CLIENT_HELLO:no shared cipher
Nov 1 11:48:27 lighttpd[34493]: (connections.c.305) SSL: 1 error:1408A10B:SSL routines:SSL3_GET_CLIENT_HELLO:wrong version number
Nov 1 11:51:04 lighttpd[34493]: (connections.c.305) SSL: 1 error:1407609C:SSL routines:SSL23_GET_CLIENT_HELLO:http request
Nov 1 11:54:23 lighttpd[34493]: (connections.c.305) SSL: 1 error:1407609C:SSL routines:SSL23_GET_CLIENT_HELLO:http request
Nov 1 11:58:29 lighttpd[34493]: (connections.c.305) SSL: 1 error:1408A0C1:SSL routines:SSL3_GET_CLIENT_HELLO:no shared cipher
Nov 1 11:58:29 lighttpd[34493]: (connections.c.305) SSL: 1 error:1408A10B:SSL routines:SSL3_GET_CLIENT_HELLO:wrong version number
Nov 1 12:02:27 lighttpd[34493]: (connections.c.305) SSL: 1 error:1407609C:SSL routines:SSL23_GET_CLIENT_HELLO:http request
Nov 1 12:05:33 lighttpd[34493]: (connections.c.305) SSL: 1 error:1407609C:SSL routines:SSL23_GET_CLIENT_HELLO:http request
Nov 1 12:08:31 lighttpd[34493]: (connections.c.305) SSL: 1 error:1408A0C1:SSL routines:SSL3_GET_CLIENT_HELLO:no shared cipher
Nov 1 12:08:31 lighttpd[34493]: (connections.c.305) SSL: 1 error:1408A10B:SSL routines:SSL3_GET_CLIENT_HELLO:wrong version number

Gertjan

Probably a client connection to a '443' (https) not using a https 'talk'.