Unholy device eats 100% cpu php-fpm by accessing captive portal like hell

deajan

Remote web access is already limited to my IP so I guess this isn't the problem here.
Is there any php profiling available in pfSense ?

kejianshi

Not sure. I'm often wrong about these things, but me personally.

I'd limit it to VPN access only.

cmb

Wouldn't end up with that even if your GUI was open to the world (though that would be a bad idea regardless).

Anything in your system log?

A truss of what the process is doing might be telling. For instance on the top PID you're showing there:

truss -p 83276 -o /root/83276-truss.txt

Let that run for 30 seconds or so and hit ctrl-c, then download that file and attach it. Can repeat same for the other php-fpm PIDs if it's like what you're showing there, with 3 separate ones chewing up a good deal of CPU.

deajan

System logs show the following messages like every hour or so:


JJun 23 09:31:35	kernel		sonewconn: pcb 0xfffff80061c4d188: Listen queue overflow: 193 already in queue awaiting acceptance (1288 occurrences)
Jun 23 09:30:34	kernel		sonewconn: pcb 0xfffff80061c4d188: Listen queue overflow: 193 already in queue awaiting acceptance (523 occurrences)

The number 193 and pcb address always stays the same accross all messages.
I've tried to isolate which process has the pcb address with lsof | grep 0xfff…. but I didn't get any result.
I've tried to increase kern.ipc.somaxconn from 128 to 512 but this doen't change cpu usage.

The php-fpm processes spawn and exit after like 30 seconds, so truss output might be shorter than 30 seconds.
Attached are two truss outputs.

PS: as for the open web configurator access, I worked on a tap VPN bridge past days, but in the meantime I had to be able to configure it :)

truss.73254.log.tar.gz
truss.13089.log.tar.gz

cmb

It's captive portal that's getting beat up there. Guessing you don't have a "Maximum concurrent connections" limit set in your CP, and some device is on that network, not logged into the portal, with an update checker or something running in the background that's causing numerous non-stop requests to hit CP.

deajan

Thank you for the insight.
I renamed the thread to something more explicit. Can a forum admin move it to captiveportal perhaps ?

After knowing what I was searching for, I isolated the problem to one device that tries to access 50-70 times a second an URL.

tail /var/log/nginx.log


Jun 24 10:53:09 pfsensebox pfsensebox.company.local nginx: 172.16.2.238 - - [24/Jun/2016:10:53:09 +0200] "GET /index.php?zone=lan&redirurl=http%3A%2F%2Fweatherblink.wdgserv.com%2Fweatherblink%2Flookup%2FLyon%2C+France HTTP/1.1" 499 0 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
Jun 24 10:53:09 pfsensebox pfsensebox.company.local nginx: 172.16.2.238 - - [24/Jun/2016:10:53:09 +0200] "GET /index.php?zone=lan&redirurl=http%3A%2F%2Fweatherblink.wdgserv.com%2Fweatherblink%2Flookup%2FLyon%2C+France HTTP/1.1" 499 0 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
Jun 24 10:53:09 pfsensebox pfsensebox.company.local nginx: 172.16.2.238 - - [24/Jun/2016:10:53:09 +0200] "GET /weatherblink/lookup/Lyon,%20France HTTP/1.1" 302 5 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
Jun 24 10:53:09 pfsensebox pfsensebox.company.local nginx: 172.16.2.238 - - [24/Jun/2016:10:53:09 +0200] "GET /weatherblink/lookup/Lyon,%20France HTTP/1.1" 302 5 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
Jun 24 10:53:09 pfsensebox pfsensebox.company.local nginx: 172.16.2.238 - - [24/Jun/2016:10:53:09 +0200] "GET /weatherblink/lookup/Lyon,%20France HTTP/1.1" 302 5 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
Jun 24 10:53:09 pfsensebox pfsensebox.company.local nginx: 172.16.2.238 - - [24/Jun/2016:10:53:09 +0200] "GET /weatherblink/lookup/Lyon,%20France HTTP/1.1" 302 5 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
Jun 24 10:53:09 pfsensebox pfsensebox.company.local nginx: 172.16.2.238 - - [24/Jun/2016:10:53:09 +0200] "GET /weatherblink/lookup/Lyon,%20France HTTP/1.1" 302 5 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
Jun 24 10:53:09 pfsensebox pfsensebox.company.local nginx: 172.16.2.238 - - [24/Jun/2016:10:53:09 +0200] "GET /weatherblink/lookup/Lyon,%20France HTTP/1.1" 302 5 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
Jun 24 10:53:09 pfsensebox pfsensebox.company.local nginx: 172.16.2.238 - - [24/Jun/2016:10:53:09 +0200] "GET /weatherblink/lookup/Lyon,%20France HTTP/1.1" 302 5 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
Jun 24 10:53:09 pfsensebox pfsensebox.company.local nginx: 172.16.2.238 - - [24/Jun/2016:10:53:09 +0200] "GET /index.php?zone=lan&redirurl=http%3A%2F%2Fweatherblink.wdgserv.com%2Fweatherblink%2Flookup%2FLyon%2C+France HTTP/1.1" 499 0 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
Jun 24 10:53:09 pfsensebox pfsensebox.company.local nginx: 172.16.2.238 - - [24/Jun/2016:10:53:09 +0200] "GET /weatherblink/lookup/Lyon,%20France HTTP/1.1" 302 5 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
.......

The setup I have is an open internet access for 200 student flats, where the user account is created on the fly by the CP page.
Everyone can bring as many devices as they want, that's the demand of the customer.

Is there any way I can limit CP demands per IP or MAC base in order to mitigate some random pri*** using the worst software ever written to flood my pfsense CP ?

I was thinking that between 2.3.1-1 and 2.3.1-5, the keepalive_timeout for the CP nginx instance was disabled (#6421). Maybe this triggered the new behavior ?
Also, maybe lowering the value of limit_conn addr could help.

Any thoughts ?

Regards,
Ozy.

deajan

OMG ! Reading through the code in system.inc, I realized that limit_conn addr = "Maximum concurrent connections" in the CP web interface.
I lowered the default value (10) to 3. Will see what happens.

Any idea what's a "good" value for the max concurrent connections value ?

cmb

Good point, didn't think about it but I bet the keepalive timeout being disabled now probably makes the simultaneous connections significantly less effective than it used to be. Those stupid apps that are doing repeated calls like that are likely using some kind of library to do it that will utilize the keepalive. When the keepalive is disabled, they'll issue a new request every time and close out the former immediately, so if requests are issued sequentially you can rack up a huge load with something really persistent.

How many is appropriate depends on what's in your portal page. If it's just the HTML page with no images, no CSS, js, etc. - nothing other than one page to fetch, then 2-3 is probably a fine limit.

Derelict

Maybe a keepalive of like 2-5 seconds would be better? Enough to reap the benefits in these rapid-fire cases but should work for all but the fastest CP logins?

If it is establishing a new session every time it won't matter anyway. Have to see a tcpdump.

cmb

Guessing since it started being noticeable post-upgrade it was probably obeying the keepalive. Disabling it is definitely nicer for purposes of redirecting users instantaneously in all cases. Maybe a smaller keepalive would be more appropriate, something like 2 seconds possibly. There are also rate limiting options in nginx which might be more appropriate instead, or maybe in addition to. Something like 1 request/sec, though for users who have CSS, images, etc. on the portal page that's too low. Maybe make that a configurable option as well.

deajan

My CP page is kinda complex with mysql, css, bootstrap and a lot of php code.

Setting maximum concurrent connections from 10 to 3 decreased cpu usage from 100 to 30%.
Setting keepalive 2 gave an additionnal cpu drop, usage is now 7%-15%.

Maybe setting keepalive to 2 is indeed a good solution for next release ?

Regards,
Ozy.