Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    22.05 - CP clients have connectivity issues after x amount of time

    Captive Portal
    6
    44
    6.5k
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • H
      heper @heper
      last edited by heper

      problems persist. no clue how to debug. please advise

      packet capture when broken:
      ef48f1cf-fbde-4ff2-bf00-debdd8bfbc0c-image.png
      diag->limiter info when broken:
      78c4b283-75e2-4883-a521-d164d286633d-image.png

      editing CP, check "Enable per-user bandwidth restriction" => save
      edit CP, un-check "Enable per-user bandwidth restriction" => save
      portal starts working again

      packet capture afterwards:
      879c19de-c7bf-4bfe-ad73-67c1f1eade11-image.png
      limiter info afterwards:
      448c6e2f-641b-46d6-81fc-d692c8657e56-image.png

      so for whatever reason shit gets fucked.
      please advise how to debug. sorry to tag you @stephenw10

      additional info:

      just started noticeing around 20 seconds of lost pings/downtime on the entire CP-vlan everytime when nginx.log gets rotated & bzip'd.
      will check if this is somehow related. nginx log gets rotated every 5 minutes or so because of CP requests

      (update 2): it happened again around the time the dhcpd log rotated .

      GertjanG S 2 Replies Last reply Reply Quote 0
      • GertjanG
        Gertjan @heper
        last edited by Gertjan

        @heper

        Rotating, if teh file are small, can happen a lot => make the log files a lot bigger
        Bzipping is optional. Really need it ? I've stopped that.

        You have a MAX (like me) after all, so space isn't really an issue (90 Gbytes or so ?)

        The nginx log mostly contains the captive portal login page visits, these are a couple of line per visitor per login - and me when I visit the Dashboard / GUI etc.
        You have that many lines ?

        Btw : not related to your issue, I know.

        No "help me" PM's please. Use the forum, the community will thank you.
        Edit : and where are the logs ??

        H 1 Reply Last reply Reply Quote 0
        • H
          heper @Gertjan
          last edited by

          @gertjan
          at around 3pm (15:00) this afternoon i've increased the nginx log to 5MB & increased the dhcp log to 2MB
          this should reduce the number of log rotations.

          the default 500k nginx log would rotate approx every 3 minutes:
          a79b71da-ebc0-4cee-9706-eba17d755a3a-image.png

          contents of nginx log is spammed with clients hotspot detection crap like this:
          63b17a43-ec74-472c-ab7b-8502678c3aef-image.png

          i will disable bzip. but i'm thinking the excessive hotspot_detection logging is a consequence of the connection issues the clients are having at random intervals. i doubt logrotating is the actual cause of the connection issues.

          currently it's 6:45pm.
          1 test-client is still connected to CP:
          7c4565dc-7345-4759-ac8a-b89cb759f461-image.png
          as seen above, there is no traffic flowing, not a single reply on any of the tcp or icmp requests. not even pfsense is responding to ping reply (172.16.20.1)

          1)status -> filterreload: situation remains the same.

          1. status -> services -> captiveportal restart: situation remains the same.

          2. services -> CP -> change idle timeout + save: situation remains the same

          3. services -> CP -> change 'per-user-bandwidth-restriction' setting to anything different:
            9bb692f4-f581-4c7a-a3b5-913ae0b17bb3-image.png
            as one can see, tcp & icmp requests start getting replies again

          still no clue how,when,why things stop working after some time. no even sure if there is a fixed interval.

          hopefully someone can give me some debugging pointers

          1 Reply Last reply Reply Quote 0
          • stephenw10S
            stephenw10 Netgate Administrator
            last edited by

            Can I assume you do not see either of these issues if per user bandwidth is not enabled at all?

            Steve

            H 1 Reply Last reply Reply Quote 0
            • H
              heper @stephenw10
              last edited by

              @stephenw10
              issue remains even when per-user-bandwidth is disabled.
              it doesn't matter if it's enabled or disabled.

              everything starts working temporary when i make a change to the setting.
              any change works:

              • disabling works
              • enabling works
              • changing the speeds works

              all for x amount of time until it breaks down again

              1 Reply Last reply Reply Quote 0
              • stephenw10S
                stephenw10 Netgate Administrator
                last edited by

                Hmm. And the number of connected clients doesn't make any difference?

                Do existing states remain passing traffic? Just new connections fail?

                H 1 Reply Last reply Reply Quote 0
                • H
                  heper @stephenw10
                  last edited by

                  @stephenw10
                  If number of clients increase - then issue returns more quickly.

                  Some states seem to remain. At least for a while.
                  Traffic comes to a complete stop.

                  Sometimes it fixes itself after a couple of minutes. Sometimes it stays 'stuck' forever

                  M 1 Reply Last reply Reply Quote 0
                  • M
                    marcosm Netgate @heper
                    last edited by

                    FYI this is likely related:
                    https://redmine.pfsense.org/issues/13150#note-16

                    H 1 Reply Last reply Reply Quote 1
                    • H
                      heper @marcosm
                      last edited by heper

                      @marcosm
                      That ticket might be related but I'm not using radius attributes for per-user bandwidth limiting.

                      Also the problem occurs when disabling per-user bandwidth limiting all together.

                      But I do believe limiters are somehow involved.
                      At this point it is more of a gut feeling. I don't know how/where to read/interpret the realtime limiter-data to get to the bottom of this.

                      Will try to gain more insight in the redmine you've posted

                      1 Reply Last reply Reply Quote 0
                      • stephenw10S
                        stephenw10 Netgate Administrator
                        last edited by

                        It definitely 'feels' like Limiters and all of that code changed in 22.05 with the removal of ipfw.

                        It's probably related to that ticket because it similarly pulls in the expected values when the rulset is reapplied.

                        Still looking into this...

                        H 1 Reply Last reply Reply Quote 0
                        • H
                          heper @stephenw10
                          last edited by heper

                          @stephenw10

                          i've found a fancy new party trick to be able to reproduce it on my system:

                          1. disconnect a user from captiveportal by status->captiveportal
                          2. All traffic for all users stop
                          3. reconnect a different user to captiveporal
                          4. traffic starts flowing again for all devices
                          5. repeat this a couple of times to get the blood of users boiling
                          6. create a screencapture

                          could not attach screencapture gif to this post (>2MB)
                          see here: https://imgur.com/a/GKCesgF

                          GertjanG 1 Reply Last reply Reply Quote 0
                          • GertjanG
                            Gertjan @heper
                            last edited by

                            @heper

                            If was about to ask about your point 2 : are you sure ?
                            Then I watched your movie .....
                            I'm pretty confident some will now know where to look.

                            No "help me" PM's please. Use the forum, the community will thank you.
                            Edit : and where are the logs ??

                            H 1 Reply Last reply Reply Quote 0
                            • H
                              heper @Gertjan
                              last edited by

                              @gertjan
                              I hope someone will figure this out quickly...

                              If not they'll burn me at the stake

                              GertjanG 1 Reply Last reply Reply Quote 0
                              • GertjanG
                                Gertjan @heper
                                last edited by

                                @heper

                                You've said that a a save on the captive portal settings page made things flow again.

                                I mean, when you change your

                                3.reconnect a different user to captiveporal

                                for

                                3 save the portal settings

                                is the blood temperature going down ?

                                If I fabricate a small script that runs every minute that compares the list with connected users with the previous minute old list. If something changed, then the script executes a "captive portal save".

                                Just as a work around, for the time being.

                                No "help me" PM's please. Use the forum, the community will thank you.
                                Edit : and where are the logs ??

                                H 1 Reply Last reply Reply Quote 0
                                • H
                                  heper @Gertjan
                                  last edited by heper

                                  @gertjan
                                  just hitting save does not fix the problem.
                                  i've mentioned this in one of the previous posts in this thread:

                                  • status -> filterreload: situation remains the same.

                                  • status -> services -> captiveportal restart: situation remains the same.

                                  • services -> CP -> edit -> save: situation remains the same

                                  • services -> CP -> edit -> change idle timeout + save: situation remains the same

                                  • services -> CP ->edit -> change 'per-user-bandwidth-restriction' setting to anything different (doesnt matter if its enable / disable / change of speed) ==> FIXED

                                  GertjanG 1 Reply Last reply Reply Quote 0
                                  • GertjanG
                                    Gertjan @heper
                                    last edited by

                                    @heper

                                    "Save" with a minor change (like change of speed from 10 to 12 Mbits or 12 to 10) => FIXED.
                                    Ok - I work something out.

                                    No "help me" PM's please. Use the forum, the community will thank you.
                                    Edit : and where are the logs ??

                                    H 1 Reply Last reply Reply Quote 0
                                    • H
                                      heper @Gertjan
                                      last edited by heper

                                      @gertjan

                                      with the information i've found today, the issue occurs on disconnect.

                                      i'm currently running an experiment:
                                      i wonder if backend code treats manual disconnects the same way as IDLE timeouts.
                                      Because if the same "bug" is triggered by IDLE-timeouts that could explain the randomness i'm experiencing.

                                      So i've currently set the CP idle-timeout to blank
                                      and i'll increase the dhcp leasetime to 8 hours or so to artificially prevent CP-clients from "disconnecting".

                                      will post results when i have them

                                      preliminary results:
                                      last 3 hours from 12:25pm -> now (3:10pm) i haven't noticed any more outages.
                                      so it appears the workaround to disable idle-timeout & increasing dhcp-lease-time has a positive effect.
                                      Schools out in less then an hour - so i will continue to monitor the situation on monday.

                                      GertjanG 1 Reply Last reply Reply Quote 0
                                      • stephenw10S
                                        stephenw10 Netgate Administrator
                                        last edited by

                                        Ah, that is a good discovery! Yeah, that has to narrow it down...

                                        1 Reply Last reply Reply Quote 0
                                        • S
                                          SteveITS Galactic Empire @heper
                                          last edited by

                                          @heper said in 22.05 - CP clients have connectivity issues after x amount of time:

                                          everytime when nginx.log gets rotated & bzip'd.

                                          re: Bzip, is your 6100 running ZFS? If so you should turn off log compression:
                                          https://docs.netgate.com/pfsense/en/latest/releases/22-01_2-6-0.html#general
                                          "Log Compression for rotation of System Logs is now disabled by default for new ZFS installations as ZFS performs its own compression.

                                          Tip
                                          The best practice is to disable Log Compression for rotation of System Logs manually for not only existing ZFS installations, but also for any system with slower CPUs. This setting can be changed under Status > System Logs on the Settings tab."

                                          Pre-2.7.2/23.09: Only install packages for your version, or risk breaking it. Select your branch in System/Update/Update Settings.
                                          When upgrading, allow 10-15 minutes to restart, or more depending on packages and device speed.
                                          Upvote 👍 helpful posts!

                                          1 Reply Last reply Reply Quote 0
                                          • GertjanG
                                            Gertjan @heper
                                            last edited by Gertjan

                                            @heper said in 22.05 - CP clients have connectivity issues after x amount of time:

                                            i'm currently running an experiments

                                            Not sure if I actually broke mine ....

                                            Read from bottom to top :

                                            2022-09-10 11:12:31.228623+02:00 	logportalauth 	77440 	Zone: cpzone1 - ACCEPT: 001, 78:e4:00:1f:67:05, 192.168.2.122
                                            2022-09-10 11:12:30.969566+02:00 	logportalauth 	77440 	Zone: cpzone1 - Ruleno : 2008
                                            2022-09-10 11:11:58.724810+02:00 	logportalauth 	54493 	Zone: cpzone1 - ACCEPT: 203, 94:08:53:c0:47:63, 192.168.2.6
                                            2022-09-10 11:11:58.462396+02:00 	logportalauth 	54493 	Zone: cpzone1 - Ruleno : 2008
                                            2022-09-10 11:07:22.560836+02:00 	logportalauth 	3105 	Zone: cpzone1 - ACCEPT: x, ea:1a:04:4f:cc:a1, 192.168.2.6
                                            2022-09-10 11:07:22.380192+02:00 	logportalauth 	3105 	Zone: cpzone1 - Ruleno : 2008
                                            

                                            All portal clients get assigned the same '$pipeno' 2008 ? ?

                                            What I understand :
                                            My pipe numbers :
                                            2000 (2001) For my "Allowed IP Addresses" 192.168.2.2 - an AP1" - no speed limits set
                                            2002 (2003) For my "Allowed IP Addresses" 192.168.2.3 - an AP2" - no speed limits set
                                            2004 (2005) For my "Allowed IP Addresses" 192.168.2.4 - an AP3 " - no speed limits set
                                            2006 (2007) For my "Allowed Host Name" - no speed limits set " - no speed limits set

                                            so portal user get assigned pipe number 2008 (+2009); 2010 (+2011), etc.

                                            note : I've added a log line in the function captiveportal_get_next_dn_ruleno() so the returned "pipeno" gets logged.

                                            If a portal user gets delete and 'his' pipe '2008' (+'2009') get deleted, what happens with all the other user using the same pipe ?? ( I do have some ideas )

                                            This also explained the horror video from @heper : all user are are clipped at a total 10 Mbits speed => because they all use the same pipe ?

                                            Could it be that easy as that, a GUI issue ??

                                            I've looked in my radius radacct (mysql table) where I have all my connected users activity : way back (using 2.6.0, not 22.05) I can clearly see that every user gets its own "rulenumber" is "pipeno".

                                            True, I'm using FreeRadius so maybe I see smoke from another fire.

                                            edit :

                                            A script that dumps the content of the "captive portal connected user database" :

                                            All user have the de same

                                                       [1] => 2008
                                                       [pipeno] => 2008
                                            
                                            #!/usr/local/bin/php -q
                                            <?php
                                            /*
                                                    captiveportal_xxxxxx.php 
                                                    No rights reserved.
                                            */
                                            
                                                    require_once("/etc/inc/util.inc");
                                                    require_once("/etc/inc/functions.inc");
                                                    require_once("/etc/inc/captiveportal.inc");
                                            
                                                    /* Read in captive portal db */
                                                    /* Determine number of logged in users for all zones */
                                            
                                                    $count_cpusers = 0;
                                                    $users = array();
                                                    /* Is portal activated ? */
                                                    if (is_array($config['captiveportal']))
                                                            /* For every zone, do */
                                                            foreach ($config['captiveportal'] as $cpkey => $cp)
                                                                    /* Sanity check */
                                                                    if (is_array($config['captiveportal'][$cpkey]))
                                                                            /* Is zone enabled ? */
                                                                            if (array_key_exists('enable', $config['captiveportal'][$cpkey])) {
                                                                                    $cpzone = $cpkey;
                                                                                    $users = captiveportal_read_db();
                                                                                         foreach ($users as $user => $one)
                                                                                                 print_r($one);
                                                                            }
                                            
                                            ?>
                                            

                                            No "help me" PM's please. Use the forum, the community will thank you.
                                            Edit : and where are the logs ??

                                            H 1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post
                                            Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.