Charon becoming unresponsive
-
I see strongSwan 5.9.3 in FreeBSD ports but we haven't pulled it in yet. It is possible that bug is related though that may not be the same situation you are hitting.
Is there anything odd about the negotiation in the remote logs when that happened?
I don't see anything there that stands out as being a problem, but I know the
swanctl
commands to initiate and terminate usually end up staying active until the negotiation is complete, which may hold onto a VICI slot longer than expected.The status page should probably run it with
--force
or a-timeout
value now that I'm looking at the code again. -
-
Is there anything odd about the negotiation in the remote logs when that happened?
Not that Iām aware of, but I have to ask if they are willing to provide any logging if they have it, as I donāt manage the remote end. But as I have seen this happening on multiple PFsense boxes already with all kinds of connection with different brands on the other side, I doubt if this is coming from the other side or the logging from the other side will make it more clear.
It's worth a try though.I don't see anything there that stands out as being a problem, but I know the
swanctl
commands to initiate and terminate usually end up staying active until the negotiation is complete, which may hold onto a VICI slot longer than expected.The status page should probably run it with
--force
or a-timeout
value now that I'm looking at the code again.If that would prevent this situation from happening it would be great. Still wonder in wat state the service is. It is still running (unfortunately, otherwise service watchdog would have catched it) but is not responsive in any way. Do you think a --force or -timeout would prevent this state?
Anything other for me to do now? More logging or higher log levels wonāt be at any help, I think?
-
The new version of strongSwan will be in builds maybe as early as tomorrow, along with the code change I just put in for the initiate timeout and forced terminate. So I'd keep tracking snapshots this week and see if you can still replicate it in a couple days.
You could try increasing the log level on the configuration section but it's still not clear what, if anything, is happening there that could be adding to the problem.
-
Can I track progress on Redmine? Or is it not yet in there?
I see. Tnx a lot for your effort!Best regards,
Julian
-
Is there a way to see how many VICI sockets are currently in use? Limit is 3, right?
-
You can kind of get an idea by looking at:
: sockstat | grep -i vici root swanctl 60061 7 stream -> /var/run/charon.vici
There will be some other entries from
charon
of its own but those are for serving the info not requesting.You should also be able to check for a backlog of requests by looking at this:
: netstat -LAn | grep vici unix 0/0/3 /var/run/charon.vici
-
Thank you!
so 0/0/3 means 3 sockets available, 0 in use, 0 in queue, I guess.
I just now realise the Listen queue overflow messages in my start post actually might be the VICI connections. IIRC the listen queue is defined as ~ 1.5 * number of allowed connections, so the 5 already in queue awaiting acceptance makes sense now.
-
@wickeren said in Charon becoming unresponsive:
Thank you!
so 0/0/3 means 3 sockets available, 0 in use, 0 in queue, I guess.
-L Show the size of the various listen queues. The first count shows the number of unaccepted connections, the second count shows the amount of unaccepted incomplete connections, and the third count is the maximum number of queued connections.
I just now realise the Listen queue overflow messages in my start post actually might be the VICI connections. IIRC the listen queue is defined as ~ 1.5 * number of allowed connections, so the 5 already in queue awaiting acceptance makes sense now.
Exactly, that's what started my line of questioning above.
-
Have not been able to reproduce after using your fix in ipsec.inc, at least so far. In the meantime I'm using zabbix now to check the ipsec.log timestamp to see if logging has stopped and alarm me.
Is the upgrade to Strongswan 5.9.3 visible in Redmine or do I just have to check he nightly builds?Best regards,
Julian
-
It is there now. I checked a vm running
2.6.0.a.20210825.0500
and it hasstrongswan-5.9.3
-
Great! Tnx a lot!
Gonna try tonight and stress it a bit more to see if it's reproducible or not anymore.
-
For some reason easy reproducible now, even with Strongswan 5.9.3 on latest snapshot.
Fresh reboot, login, Ipsec status, bring some tunnels up and it just stops. Have to say it crashes on bringing up the exact same connection as last time. Might be a coincidence, but seeing others sides logs might be interesting. Still no answer from there though.
Sometimes bringing up an down this particular connection is no problem at all.When it was this state I have run the advised commands:
sockstat | grep -i vici
root Ā Ā swanctl Ā Ā 28047 7 Ā stream -> /var/run/charon.vici root Ā Ā swanctl Ā Ā 19770 7 Ā stream -> /var/run/charon.vici root Ā Ā swanctl Ā Ā 19222 7 Ā stream -> /var/run/charon.vici root Ā Ā php-fpm Ā Ā 22436 14 stream -> /var/run/charon.vici root Ā Ā charon Ā Ā 60508 22 stream /var/run/charon.vici root Ā Ā charon Ā Ā 60508 25 stream /var/run/charon.vici root Ā Ā charon Ā Ā 60508 26 stream /var/run/charon.vici root Ā Ā php-fpm Ā Ā 328 Ā 14 stream -> /var/run/charon.vici ? Ā Ā Ā Ā ? Ā Ā Ā Ā Ā ? Ā Ā ? Ā stream /var/run/charon.vici ? Ā Ā Ā Ā ? Ā Ā Ā Ā Ā ? Ā Ā ? Ā stream /var/run/charon.vici ? Ā Ā Ā Ā ? Ā Ā Ā Ā Ā ? Ā Ā ? Ā stream /var/run/charon.vici
netstat -LAn | grep vici
unix Ā 3/0/3 Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā /var/run/charon.vici
Can provide complete IPSEC logs if needed, but it stalls at the same point in P2 initiate as last provided log snippet and then it all ends until rebooting.
Any other thing I could do to collect some more info? I just canāt believe to be the only one being able to reproduce this, on even multiple boxes. It all are KVM VPS boxes if that might matter at all.
Best regards,
Julian
-
I've got about 20 systems interconnected with IPsec tunnels that I setup when working on IPsec for 21.09/2.6.0 and thus far I haven't been able to reproduce it on any of them, no matter how much I click around on the status page. They are a mix of Plus and CE both on hardware and VMs (Proxmox/KVM and ESX)
Only thing I can think of is maybe something is in the middle of making a request when IPsec is reloaded at boot time, but even when I try to do that manually I can't get it to break here.
There must be some timing, environmental, or procedural component we haven't figured out yet that is contributing.
-
some more info:
ipsec status shows this:
con15_105{139}: ROUTED, TUNNEL, reqid 139
con15_105{139}: xxx.xxx.xxx.0/24|/0 === yyy.yyy.yyy.0/24|/0
con15_104{138}: ROUTED, TUNNEL, reqid 138
con15_104{138}: xxx.xxx.xxx.0/24|/0 === yyy.yyy.yyy.0/24|/0However, con15 show disconnected on the status screen. Do I misunderstand the info from ipsec status?
Trying to connect from status screen sometimes triggers the situation, especially shorly after boot.
-
On Strongswan github they suggest:
Configure logging, install debug symbols (or don't strip the binaries), and attach a debugger when it happens. Then provide stacktraces for all threads.
Then we can say something.I have to say that I havenāt done this before, let alone on freebsd. I think I need to run pkg install pfSense-kernel-debug for debug symbols but Iām unsure what to do next to get the stacktraces.
-
Debugging is probably a bit beyond what you'd be able to do since it would generate tons and tons of data and it would be nearly impossible to sift through it all to tell what the interesting bits are.
You could try
truss -fp xxxxx
wherexxxxx
is the PID of the charon process to see what happens at the time.Or
truss -f swanctl <blah>
when attempting a manual connect/disconnect to see if it shows anything useful.I kind of doubt
swanctl
would be helpful in this case since it's probably just blocked waiting on a request to vici to do something and never getting a response so it piles up. -
Tnx again!
It looks like there are 2 charon processes running:
pgrep -f charon 58278 58059
Wonder if that is correct. First PID gives output in truss, second seems empty.
Will write ouput to a file and see if it reveals anything usefull. -
That could definitely be a problem if there are two running. I'm not sure how that might happen, though.
Is it actually two and not just two threads or a worker of some kind?
This is normal:
root 35995 0.0 0.2 10752 1696 - Is 15:38 0:00.00 daemon: /usr/local/libexec/ipsec/charon[36322] (daemon) root 36322 0.0 2.0 51140 19356 - I 15:38 1:07.25 /usr/local/libexec/ipsec/charon --use-syslog
But if you see multiples of either of those then it could be the source of the problem.
There could be some kind of an odd timing issue where it's getting started twice, a race condition that depends on other environmental factors that may vary.
Might need to comb through the logs for both of those PIDs and then check other logs for entries around the time when the second PID started. See if there was anything logged in the system log, gateway log, etc.
-
I'm having the same issue! It took a lot of googling around to find this.
I'm running PFSense 22.01-RELEASE (arm) on a Netgate 3100. The IPSec tunnel is an AWS site-to-site VPN connection. I've only noticed this issue after upgrading the device after the recent CVE for authenticated RCE and CSRF.
IPsec logs will randomly stop, and the status page shows the tunnels as disconnected. Connecting them does nothing, restarting the service does nothing. When I SSH into it, The charon command is running but the second forked daemon isn't. I have no crash dumps, and the vici sockets are all used up. Running any commands against the socket fail as I get connections refused.
To recover from this state. I have to kill -9 the charon command, and restart the IPSec service and manually connect the tunnel connections.
I also can't find a pattern as to when this happens, but it does happen multiple times a week.
Let me know if there's any additional information I can collect for you guys about this issue, as it's quite annoying.