Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    Unholy device eats 100% cpu php-fpm by accessing captive portal like hell

    Scheduled Pinned Locked Moved Problems Installing or Upgrading pfSense Software
    15 Posts 4 Posters 4.1k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • K Offline
      kejianshi
      last edited by

      Close that and use a vpn or SSH.  Opening your web access to the WAN is stupid.

      Don't feel bad - Everyone is stupid before they get smart.

      1 Reply Last reply Reply Quote 0
      • D Offline
        deajan
        last edited by

        Remote web access is already limited to my IP so I guess this isn't the problem here.
        Is there any php profiling available in pfSense ?

        NetPOWER.fr - some opensource stuff for IT people

        1 Reply Last reply Reply Quote 0
        • K Offline
          kejianshi
          last edited by

          Not sure.  I'm often wrong about these things, but me personally.

          I'd limit it to VPN access only.

          1 Reply Last reply Reply Quote 0
          • C Offline
            cmb
            last edited by

            Wouldn't end up with that even if your GUI was open to the world (though that would be a bad idea regardless).

            Anything in your system log?

            A truss of what the process is doing might be telling. For instance on the top PID you're showing there:

            truss -p 83276 -o /root/83276-truss.txt
            

            Let that run for 30 seconds or so and hit ctrl-c, then download that file and attach it. Can repeat same for the other php-fpm PIDs if it's like what you're showing there, with 3 separate ones chewing up a good deal of CPU.

            1 Reply Last reply Reply Quote 0
            • D Offline
              deajan
              last edited by

              System logs show the following messages like every hour or so:

              
              JJun 23 09:31:35	kernel		sonewconn: pcb 0xfffff80061c4d188: Listen queue overflow: 193 already in queue awaiting acceptance (1288 occurrences)
              Jun 23 09:30:34	kernel		sonewconn: pcb 0xfffff80061c4d188: Listen queue overflow: 193 already in queue awaiting acceptance (523 occurrences)
              
              

              The number 193 and pcb address always stays the same accross all messages.
              I've tried to isolate which process has the pcb address with lsof | grep 0xfff…. but I didn't get any result.
              I've tried to increase kern.ipc.somaxconn from 128 to 512 but this doen't change cpu usage.

              The php-fpm processes spawn and exit after like 30 seconds, so truss output might be shorter than 30 seconds.
              Attached are two truss outputs.

              PS: as for the open web configurator access, I worked on a tap VPN bridge past days, but in the meantime I had to be able to configure it :)

              truss.73254.log.tar.gz
              truss.13089.log.tar.gz

              NetPOWER.fr - some opensource stuff for IT people

              1 Reply Last reply Reply Quote 0
              • C Offline
                cmb
                last edited by

                It's captive portal that's getting beat up there. Guessing you don't have a "Maximum concurrent connections" limit set in your CP, and some device is on that network, not logged into the portal, with an update checker or something running in the background that's causing numerous non-stop requests to hit CP.

                1 Reply Last reply Reply Quote 0
                • D Offline
                  deajan
                  last edited by

                  Thank you for the insight.
                  I renamed the thread to something more explicit. Can a forum admin move it to captiveportal perhaps ?

                  After knowing what I was searching for, I isolated the problem to one device that tries to access 50-70 times a second an URL.

                  tail /var/log/nginx.log

                  
                  Jun 24 10:53:09 pfsensebox pfsensebox.company.local nginx: 172.16.2.238 - - [24/Jun/2016:10:53:09 +0200] "GET /index.php?zone=lan&redirurl=http%3A%2F%2Fweatherblink.wdgserv.com%2Fweatherblink%2Flookup%2FLyon%2C+France HTTP/1.1" 499 0 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
                  Jun 24 10:53:09 pfsensebox pfsensebox.company.local nginx: 172.16.2.238 - - [24/Jun/2016:10:53:09 +0200] "GET /index.php?zone=lan&redirurl=http%3A%2F%2Fweatherblink.wdgserv.com%2Fweatherblink%2Flookup%2FLyon%2C+France HTTP/1.1" 499 0 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
                  Jun 24 10:53:09 pfsensebox pfsensebox.company.local nginx: 172.16.2.238 - - [24/Jun/2016:10:53:09 +0200] "GET /weatherblink/lookup/Lyon,%20France HTTP/1.1" 302 5 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
                  Jun 24 10:53:09 pfsensebox pfsensebox.company.local nginx: 172.16.2.238 - - [24/Jun/2016:10:53:09 +0200] "GET /weatherblink/lookup/Lyon,%20France HTTP/1.1" 302 5 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
                  Jun 24 10:53:09 pfsensebox pfsensebox.company.local nginx: 172.16.2.238 - - [24/Jun/2016:10:53:09 +0200] "GET /weatherblink/lookup/Lyon,%20France HTTP/1.1" 302 5 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
                  Jun 24 10:53:09 pfsensebox pfsensebox.company.local nginx: 172.16.2.238 - - [24/Jun/2016:10:53:09 +0200] "GET /weatherblink/lookup/Lyon,%20France HTTP/1.1" 302 5 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
                  Jun 24 10:53:09 pfsensebox pfsensebox.company.local nginx: 172.16.2.238 - - [24/Jun/2016:10:53:09 +0200] "GET /weatherblink/lookup/Lyon,%20France HTTP/1.1" 302 5 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
                  Jun 24 10:53:09 pfsensebox pfsensebox.company.local nginx: 172.16.2.238 - - [24/Jun/2016:10:53:09 +0200] "GET /weatherblink/lookup/Lyon,%20France HTTP/1.1" 302 5 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
                  Jun 24 10:53:09 pfsensebox pfsensebox.company.local nginx: 172.16.2.238 - - [24/Jun/2016:10:53:09 +0200] "GET /weatherblink/lookup/Lyon,%20France HTTP/1.1" 302 5 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
                  Jun 24 10:53:09 pfsensebox pfsensebox.company.local nginx: 172.16.2.238 - - [24/Jun/2016:10:53:09 +0200] "GET /index.php?zone=lan&redirurl=http%3A%2F%2Fweatherblink.wdgserv.com%2Fweatherblink%2Flookup%2FLyon%2C+France HTTP/1.1" 499 0 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
                  Jun 24 10:53:09 pfsensebox pfsensebox.company.local nginx: 172.16.2.238 - - [24/Jun/2016:10:53:09 +0200] "GET /weatherblink/lookup/Lyon,%20France HTTP/1.1" 302 5 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
                  .......
                  
                  

                  The setup I have is an open internet access for 200 student flats, where the user account is created on the fly by the CP page.
                  Everyone can bring as many devices as they want, that's the demand of the customer.

                  Is there any way I can limit CP demands per IP or MAC base in order to mitigate some random pri*** using the worst software ever written to flood my pfsense CP ?

                  I was thinking that between 2.3.1-1 and 2.3.1-5, the keepalive_timeout for the CP nginx instance was disabled (#6421). Maybe this triggered the new behavior ?
                  Also, maybe lowering the value of limit_conn addr could help.

                  Any thoughts ?

                  Regards,
                  Ozy.

                  NetPOWER.fr - some opensource stuff for IT people

                  1 Reply Last reply Reply Quote 0
                  • D Offline
                    deajan
                    last edited by

                    OMG ! Reading through the code in system.inc, I realized that limit_conn addr = "Maximum concurrent connections" in the CP web interface.
                    I lowered the default value (10) to 3. Will see what happens.

                    Any idea what's a "good" value for the max concurrent connections value ?

                    NetPOWER.fr - some opensource stuff for IT people

                    1 Reply Last reply Reply Quote 0
                    • C Offline
                      cmb
                      last edited by

                      Good point, didn't think about it but I bet the keepalive timeout being disabled now probably makes the simultaneous connections significantly less effective than it used to be. Those stupid apps that are doing repeated calls like that are likely using some kind of library to do it that will utilize the keepalive. When the keepalive is disabled, they'll issue a new request every time and close out the former immediately, so if requests are issued sequentially you can rack up a huge load with something really persistent.

                      How many is appropriate depends on what's in your portal page. If it's just the HTML page with no images, no CSS, js, etc. - nothing other than one page to fetch, then 2-3 is probably a fine limit.

                      1 Reply Last reply Reply Quote 0
                      • DerelictD Offline
                        Derelict LAYER 8 Netgate
                        last edited by

                        Maybe a keepalive of like 2-5 seconds would be better? Enough to reap the benefits in these rapid-fire cases but should work for all but the fastest CP logins?

                        If it is establishing a new session every time it won't matter anyway. Have to see a tcpdump.

                        Chattanooga, Tennessee, USA
                        A comprehensive network diagram is worth 10,000 words and 15 conference calls.
                        DO NOT set a source address/port in a port forward or firewall rule unless you KNOW you need it!
                        Do Not Chat For Help! NO_WAN_EGRESS(TM)

                        1 Reply Last reply Reply Quote 0
                        • C Offline
                          cmb
                          last edited by

                          Guessing since it started being noticeable post-upgrade it was probably obeying the keepalive. Disabling it is definitely nicer for purposes of redirecting users instantaneously in all cases. Maybe a smaller keepalive would be more appropriate, something like 2 seconds possibly. There are also rate limiting options in nginx which might be more appropriate instead, or maybe in addition to. Something like 1 request/sec, though for users who have CSS, images, etc. on the portal page that's too low. Maybe make that a configurable option as well.

                          1 Reply Last reply Reply Quote 0
                          • D Offline
                            deajan
                            last edited by

                            My CP page is kinda complex with mysql, css, bootstrap and a lot of php code.

                            Setting maximum concurrent connections from 10 to 3 decreased cpu usage from 100 to 30%.
                            Setting keepalive 2 gave an additionnal cpu drop, usage is now 7%-15%.

                            Maybe setting keepalive to 2 is indeed a good solution for next release ?

                            Regards,
                            Ozy.

                            NetPOWER.fr - some opensource stuff for IT people

                            1 Reply Last reply Reply Quote 0
                            • First post
                              Last post
                            Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.