Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    Charon becoming unresponsive

    Scheduled Pinned Locked Moved IPsec
    37 Posts 6 Posters 7.2k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • jimpJ
      jimp Rebel Alliance Developer Netgate @wickeren
      last edited by

      @wickeren said in Charon becoming unresponsive:

      @jimp

      Thank you!

      so 0/0/3 means 3 sockets available, 0 in use, 0 in queue, I guess.

                   -L      Show the size of the various listen queues.  The first
                           count shows the number of unaccepted connections, the
                           second count shows the amount of unaccepted incomplete
                           connections, and the third count is the maximum number of
                           queued connections.
      

      I just now realise the Listen queue overflow messages in my start post actually might be the VICI connections. IIRC the listen queue is defined as ~ 1.5 * number of allowed connections, so the 5 already in queue awaiting acceptance makes sense now.

      Exactly, that's what started my line of questioning above.

      Remember: Upvote with the 👍 button for any user/post you find to be helpful, informative, or deserving of recognition!

      Need help fast? Netgate Global Support!

      Do not Chat/PM for help!

      W 1 Reply Last reply Reply Quote 0
      • W
        wickeren @jimp
        last edited by wickeren

        @jimp

        Have not been able to reproduce after using your fix in ipsec.inc, at least so far. In the meantime I'm using zabbix now to check the ipsec.log timestamp to see if logging has stopped and alarm me.
        Is the upgrade to Strongswan 5.9.3 visible in Redmine or do I just have to check he nightly builds?

        Best regards,

        Julian

        1 Reply Last reply Reply Quote 0
        • jimpJ
          jimp Rebel Alliance Developer Netgate
          last edited by

          It is there now. I checked a vm running 2.6.0.a.20210825.0500 and it has strongswan-5.9.3

          Remember: Upvote with the 👍 button for any user/post you find to be helpful, informative, or deserving of recognition!

          Need help fast? Netgate Global Support!

          Do not Chat/PM for help!

          W 1 Reply Last reply Reply Quote 0
          • W
            wickeren @jimp
            last edited by

            @jimp

            Great! Tnx a lot!

            Gonna try tonight and stress it a bit more to see if it's reproducible or not anymore.

            W 1 Reply Last reply Reply Quote 0
            • W
              wickeren @wickeren
              last edited by wickeren

              For some reason easy reproducible now, even with Strongswan 5.9.3 on latest snapshot.
              Fresh reboot, login, Ipsec status, bring some tunnels up and it just stops. Have to say it crashes on bringing up the exact same connection as last time. Might be a coincidence, but seeing others sides logs might be interesting. Still no answer from there though.
              Sometimes bringing up an down this particular connection is no problem at all.

              When it was this state I have run the advised commands:

              sockstat | grep -i vici

              
              root     swanctl    28047 7  stream -> /var/run/charon.vici
              root     swanctl    19770 7  stream -> /var/run/charon.vici
              root     swanctl    19222 7  stream -> /var/run/charon.vici
              root     php-fpm    22436 14 stream -> /var/run/charon.vici
              root     charon     60508 22 stream /var/run/charon.vici
              root     charon     60508 25 stream /var/run/charon.vici
              root     charon     60508 26 stream /var/run/charon.vici
              root     php-fpm    328   14 stream -> /var/run/charon.vici
              ?        ?          ?     ?  stream /var/run/charon.vici
              ?        ?          ?     ?  stream /var/run/charon.vici
              ?        ?          ?     ?  stream /var/run/charon.vici
              
              

              netstat -LAn | grep vici

              
              unix  3/0/3                            /var/run/charon.vici
              
              

              Can provide complete IPSEC logs if needed, but it stalls at the same point in P2 initiate as last provided log snippet and then it all ends until rebooting.

              Any other thing I could do to collect some more info? I just can’t believe to be the only one being able to reproduce this, on even multiple boxes. It all are KVM VPS boxes if that might matter at all.

              Best regards,

              Julian

              1 Reply Last reply Reply Quote 0
              • jimpJ
                jimp Rebel Alliance Developer Netgate
                last edited by jimp

                I've got about 20 systems interconnected with IPsec tunnels that I setup when working on IPsec for 21.09/2.6.0 and thus far I haven't been able to reproduce it on any of them, no matter how much I click around on the status page. They are a mix of Plus and CE both on hardware and VMs (Proxmox/KVM and ESX)

                Only thing I can think of is maybe something is in the middle of making a request when IPsec is reloaded at boot time, but even when I try to do that manually I can't get it to break here.

                There must be some timing, environmental, or procedural component we haven't figured out yet that is contributing.

                Remember: Upvote with the 👍 button for any user/post you find to be helpful, informative, or deserving of recognition!

                Need help fast? Netgate Global Support!

                Do not Chat/PM for help!

                W 1 Reply Last reply Reply Quote 0
                • W
                  wickeren @jimp
                  last edited by

                  @jimp

                  some more info:

                  ipsec status shows this:

                  con15_105{139}: ROUTED, TUNNEL, reqid 139
                  con15_105{139}: xxx.xxx.xxx.0/24|/0 === yyy.yyy.yyy.0/24|/0
                  con15_104{138}: ROUTED, TUNNEL, reqid 138
                  con15_104{138}: xxx.xxx.xxx.0/24|/0 === yyy.yyy.yyy.0/24|/0

                  However, con15 show disconnected on the status screen. Do I misunderstand the info from ipsec status?

                  Trying to connect from status screen sometimes triggers the situation, especially shorly after boot.

                  W 1 Reply Last reply Reply Quote 0
                  • W
                    wickeren @wickeren
                    last edited by

                    On Strongswan github they suggest:

                    Configure logging, install debug symbols (or don't strip the binaries), and attach a debugger when it happens. Then provide stacktraces for all threads.
                    Then we can say something.

                    I have to say that I haven’t done this before, let alone on freebsd. I think I need to run pkg install pfSense-kernel-debug for debug symbols but I’m unsure what to do next to get the stacktraces.

                    1 Reply Last reply Reply Quote 0
                    • jimpJ
                      jimp Rebel Alliance Developer Netgate
                      last edited by

                      Debugging is probably a bit beyond what you'd be able to do since it would generate tons and tons of data and it would be nearly impossible to sift through it all to tell what the interesting bits are.

                      You could try truss -fp xxxxx where xxxxx is the PID of the charon process to see what happens at the time.

                      Or truss -f swanctl <blah> when attempting a manual connect/disconnect to see if it shows anything useful.

                      I kind of doubt swanctl would be helpful in this case since it's probably just blocked waiting on a request to vici to do something and never getting a response so it piles up.

                      Remember: Upvote with the 👍 button for any user/post you find to be helpful, informative, or deserving of recognition!

                      Need help fast? Netgate Global Support!

                      Do not Chat/PM for help!

                      W 1 Reply Last reply Reply Quote 0
                      • W
                        wickeren @jimp
                        last edited by

                        @jimp

                        Tnx again!

                        It looks like there are 2 charon processes running:

                        pgrep -f charon
                        58278
                        58059
                        

                        Wonder if that is correct. First PID gives output in truss, second seems empty.
                        Will write ouput to a file and see if it reveals anything usefull.

                        1 Reply Last reply Reply Quote 0
                        • jimpJ
                          jimp Rebel Alliance Developer Netgate
                          last edited by

                          That could definitely be a problem if there are two running. I'm not sure how that might happen, though.

                          Is it actually two and not just two threads or a worker of some kind?

                          This is normal:

                          root    35995   0.0  0.2  10752  1696  -  Is   15:38      0:00.00 daemon: /usr/local/libexec/ipsec/charon[36322] (daemon)
                          root    36322   0.0  2.0  51140 19356  -  I    15:38      1:07.25 /usr/local/libexec/ipsec/charon --use-syslog
                          

                          But if you see multiples of either of those then it could be the source of the problem.

                          There could be some kind of an odd timing issue where it's getting started twice, a race condition that depends on other environmental factors that may vary.

                          Might need to comb through the logs for both of those PIDs and then check other logs for entries around the time when the second PID started. See if there was anything logged in the system log, gateway log, etc.

                          Remember: Upvote with the 👍 button for any user/post you find to be helpful, informative, or deserving of recognition!

                          Need help fast? Netgate Global Support!

                          Do not Chat/PM for help!

                          L 1 Reply Last reply Reply Quote 0
                          • L
                            lungdart @jimp
                            last edited by

                            I'm having the same issue! It took a lot of googling around to find this.

                            I'm running PFSense 22.01-RELEASE (arm) on a Netgate 3100. The IPSec tunnel is an AWS site-to-site VPN connection. I've only noticed this issue after upgrading the device after the recent CVE for authenticated RCE and CSRF.

                            IPsec logs will randomly stop, and the status page shows the tunnels as disconnected. Connecting them does nothing, restarting the service does nothing. When I SSH into it, The charon command is running but the second forked daemon isn't. I have no crash dumps, and the vici sockets are all used up. Running any commands against the socket fail as I get connections refused.

                            To recover from this state. I have to kill -9 the charon command, and restart the IPSec service and manually connect the tunnel connections.

                            I also can't find a pattern as to when this happens, but it does happen multiple times a week.

                            Let me know if there's any additional information I can collect for you guys about this issue, as it's quite annoying.

                            H 1 Reply Last reply Reply Quote 1
                            • H
                              hsb @lungdart
                              last edited by

                              I am having same issue on pfsense + hosted on AWS after updating, from 2.2.4 to 22.01, every now and than app vpn tunnels stops working and ipsec service cannot be restarted, i have to reboot the firewall to fix the issue.
                              I have 50 VPN tunnels running with various sites.

                              Any help would be appreciated.

                              1 Reply Last reply Reply Quote 1
                              • E
                                envyron.mfernandes
                                last edited by

                                Same thing here, we have about 70 tunnels and after upgrading to 2.6.0/22.01 everyday we have all ipsecs down.

                                We need to kill -9 charon (and edit some tunnel after that to apply settings and start tunnels) or reboot the AWS EC2 instance (pfsense 22.01).

                                We have other physical box (pfsense 2.6.0) with same problem:

                                Apr 11 11:41:18 pfsense-bloco14-ha01 kernel: sonewconn: pcb 0xfffff80181d0fa00: Listen queue overflow: 5 already in queue awaiting acceptance (4 occurrences)
                                Apr 11 11:43:44 pfsense-bloco14-ha01 kernel: sonewconn: pcb 0xfffff80181d0fa00: Listen queue overflow: 5 already in queue awaiting acceptance (22 occurrences)
                                Apr 11 12:14:01 pfsense-bloco14-ha01 kernel: sonewconn: pcb 0xfffff80181d33700: Listen queue overflow: 5 already in queue awaiting acceptance (2 occurrences)
                                Apr 11 12:15:01 pfsense-bloco14-ha01 kernel: sonewconn: pcb 0xfffff80181d33700: Listen queue overflow: 5 already in queue awaiting acceptance (11 occurrences)
                                Apr 11 12:16:06 pfsense-bloco14-ha01 kernel: sonewconn: pcb 0xfffff80181d33700: Listen queue overflow: 5 already in queue awaiting acceptance (12 occurrences)
                                Apr 11 12:17:09 pfsense-bloco14-ha01 kernel: sonewconn: pcb 0xfffff80181d33700: Listen queue overflow: 5 already in queue awaiting acceptance (11 occurrences)

                                Any news so far?

                                1 Reply Last reply Reply Quote 2
                                • J
                                  jtheisen
                                  last edited by

                                  Hi, another patient here.
                                  17 P1s with a lot more Child SAs and in irregular intervals they all just stop working, even without any interaction at all with the WebGUI etc.
                                  Is this something that was/will be addressed in 2.7.0?

                                  1 Reply Last reply Reply Quote 0
                                  • jimpJ
                                    jimp Rebel Alliance Developer Netgate
                                    last edited by

                                    Yes, it's been fixed in current development snapshots of CE 2.7.0 already, and in the most recent release of pfSense Plus software.

                                    Remember: Upvote with the 👍 button for any user/post you find to be helpful, informative, or deserving of recognition!

                                    Need help fast? Netgate Global Support!

                                    Do not Chat/PM for help!

                                    1 Reply Last reply Reply Quote 1
                                    • First post
                                      Last post
                                    Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.