Captive portal random deaths
-
Root cause there is PHP's dying. With fastcgi, I guess that's 2.1.x or older version on there. Upgrade to 2.2.4 first, php-fpm is better in that regard if it's some scalability issue, and you could be triggering some problem in the old PHP version.
-
This guy:
@carzin:Oct 26 15:21:30 lighttpd[32515]: (mod_evasive.c.183) 172.18.8.102 turned away. Too many connections.
is it a client on the captive portal ?
If so, its probably a case of a lousy written 'app' that doesn't understand what a 'portal' is and hammering your your portal. The portal send over a 'login page', the client (172.18.8.102) doesn't want that page, and keeps asking again and again …. up until 'no more resources' and PHP breaks.
But, hey, that's just a thought. Can't remember well these issues with ancient versions ;)
-
This guy:
@carzin:Oct 26 15:21:30 lighttpd[32515]: (mod_evasive.c.183) 172.18.8.102 turned away. Too many connections.
is it a client on the captive portal ?
If so, its probably a case of a lousy written 'app' that doesn't understand what a 'portal' is and hammering your your portal. The portal send over a 'login page', the client (172.18.8.102) doesn't want that page, and keeps asking again and again …. up until 'no more resources' and PHP breaks.
Yes, that would be a client. The fact the client connections limit is being met should prevent it from exhausting the PHP resources. But, that is along the lines of what I was thinking, except that something it was doing repeatedly caused PHP to crash rather than just run out of resources.
-
All: this box was running 2.2.4. So I'm on the latest and greatest. I've had this problem since we started using pfsense years ago, across multiple builds.
-
It's probably not PHP. On a lower level you have this:
Oct 26 15:21:33 kernel: sonewconn: pcb 0xfffff8002c506e10: Listen queue overflow: 193 already in queue awaiting acceptance (63 occurrences)
Google FreeBSD + sonewconn (so you know that you are not the only one), try what the first link proposes.
Other links will help you nailing down the process - port - etc.
-
I need some spoon feeding. I am not a Linux guru. From the searches, I ran the following command (netstat -Lan) and saw a bunch of:
tcpX 0/0/128 which should tell me the queue size is 128.
The instructions tell you to issue the command:
sysctl kern.ipc.somaxconn=2048 and I get a readout of:kern.ipc.somaxconn:128 -> 2048
However, when I run the netstat -Lan command again, it still shows a queue value of 128. What else do I need to do?
-
I need some spoon feeding. I am not a Linux guru. …..
It even worse, Linux is not FreeBSD (at all).
Anyway, without putting my hands on your system, I can not explain why your identical pfSense is behaving differently as mine.
Adapting the queues is just a counter measure because
-> Your system can't handle the load (the queues are filing up without pfSense being able to handle it)
or
-> (so) analyze this 'load' … whats coming into your pfSense ? Is it the WAN , LAN ? other interface that is flooding ?Can you limit the number of user ?
Can tcpdump tell you something ?
What did you change from the default setup ?
Note that I'm not a network expert neither, but these are the steps that I would take to dig up the problem.
-
Well, there isn't much I can do to limit the users. The pfSense virtual machines (4 of them) are what I use to authenticate users when they connect to a setup SSID and funnel them to the appropriate configuration website. I use the DNS forwarding functionality to limit what they have access to after they connect. So, I have no control over how the users connect, or really, how many connect.
I suspect I see a lot more load on my boxes than most of you. At peak, I can have 100s of users connecting through at a single instance. And the box works just fine with that load. The pfSense death happens for apparently no reason, and is not generally associated with load. Which is why I liked the idea of a 'bad client' basically beating the hell outta the server until it dies.
-
Just a thought.
You said:
Well, there isn't much I can do to limit the users
but you really 'nag' them with this:
I use the DNS forwarding functionality to limit what they have access to after they connect.
What I make of it:
The users device knows it is connected (there is a DNS server, a gateway) : the link seems up.
But may DNS requests will not receive a reply - or a wrong reply.
What does the 'app' doing with this situation ?? A request to resolve i.e. facebook.com will yield many retries because it 'won't work'.So: use tcpdump incoming port 53 - protocol UDP and TCP to see if your DNS resolver get swamped …
=> This is just an idea ....
-
This is fun. Another zone, different from the last time, died. And this is in the syslog:
Nov 1 10:58:17 lighttpd[34493]: (connections.c.305) SSL: 1 error:1408A0C1:SSL routines:SSL3_GET_CLIENT_HELLO:no shared cipher
Nov 1 10:58:17 lighttpd[34493]: (connections.c.305) SSL: 1 error:1408A10B:SSL routines:SSL3_GET_CLIENT_HELLO:wrong version number
Nov 1 11:08:19 lighttpd[34493]: (connections.c.305) SSL: 1 error:1408A0C1:SSL routines:SSL3_GET_CLIENT_HELLO:no shared cipher
Nov 1 11:08:19 lighttpd[34493]: (connections.c.305) SSL: 1 error:1408A10B:SSL routines:SSL3_GET_CLIENT_HELLO:wrong version number
Nov 1 11:18:21 lighttpd[34493]: (connections.c.305) SSL: 1 error:1408A0C1:SSL routines:SSL3_GET_CLIENT_HELLO:no shared cipher
Nov 1 11:18:21 lighttpd[34493]: (connections.c.305) SSL: 1 error:1408A10B:SSL routines:SSL3_GET_CLIENT_HELLO:wrong version number
Nov 1 11:28:23 lighttpd[34493]: (connections.c.305) SSL: 1 error:1408A0C1:SSL routines:SSL3_GET_CLIENT_HELLO:no shared cipher
Nov 1 11:28:23 lighttpd[34493]: (connections.c.305) SSL: 1 error:1408A10B:SSL routines:SSL3_GET_CLIENT_HELLO:wrong version number
Nov 1 11:38:25 lighttpd[34493]: (connections.c.305) SSL: 1 error:1408A0C1:SSL routines:SSL3_GET_CLIENT_HELLO:no shared cipher
Nov 1 11:38:25 lighttpd[34493]: (connections.c.305) SSL: 1 error:1408A10B:SSL routines:SSL3_GET_CLIENT_HELLO:wrong version number
Nov 1 11:48:27 lighttpd[34493]: (connections.c.305) SSL: 1 error:1408A0C1:SSL routines:SSL3_GET_CLIENT_HELLO:no shared cipher
Nov 1 11:48:27 lighttpd[34493]: (connections.c.305) SSL: 1 error:1408A10B:SSL routines:SSL3_GET_CLIENT_HELLO:wrong version number
Nov 1 11:51:04 lighttpd[34493]: (connections.c.305) SSL: 1 error:1407609C:SSL routines:SSL23_GET_CLIENT_HELLO:http request
Nov 1 11:54:23 lighttpd[34493]: (connections.c.305) SSL: 1 error:1407609C:SSL routines:SSL23_GET_CLIENT_HELLO:http request
Nov 1 11:58:29 lighttpd[34493]: (connections.c.305) SSL: 1 error:1408A0C1:SSL routines:SSL3_GET_CLIENT_HELLO:no shared cipher
Nov 1 11:58:29 lighttpd[34493]: (connections.c.305) SSL: 1 error:1408A10B:SSL routines:SSL3_GET_CLIENT_HELLO:wrong version number
Nov 1 12:02:27 lighttpd[34493]: (connections.c.305) SSL: 1 error:1407609C:SSL routines:SSL23_GET_CLIENT_HELLO:http request
Nov 1 12:05:33 lighttpd[34493]: (connections.c.305) SSL: 1 error:1407609C:SSL routines:SSL23_GET_CLIENT_HELLO:http request
Nov 1 12:08:31 lighttpd[34493]: (connections.c.305) SSL: 1 error:1408A0C1:SSL routines:SSL3_GET_CLIENT_HELLO:no shared cipher
Nov 1 12:08:31 lighttpd[34493]: (connections.c.305) SSL: 1 error:1408A10B:SSL routines:SSL3_GET_CLIENT_HELLO:wrong version number -
Probably a client connection to a '443' (https) not using a https 'talk'.