VTI to AWS becomes unstable after some time (weeks or months), requiring restart.
-
Hi all --
I've run across the same behavior now on a couple of different pfSense implementations and AWS tunnels to different regions / VPCs.
The setup is standard VTI to AWS, two tunnels up simultaneously and BGP via OpenBGPD (because FRR still seems not to play nice with VTI). After a fairly long period of time (weeks to months), the tunnels start to show heavy packet loss -- 10% to 40% or more via gateway monitoring (and real world observation) and seem only to recover if I fail over to the secondary firewall in a HA pair and/or reboot the appliance.
This is happening both on a couple of pfSense CE VMs we run on prem as well as an Azure VM. I'm still in the information gathering stage right now, as I just saw the behavior on a second implementation for the first time. Just wondering if this is known; I searched and didn't find anything that looked like this specific issue.
Crypto is: IKEv2, AES256,SHA256,DH14. VTI (P2) is AES256,SHA256,DH14.
Thanks,
M
-
I'm not aware of anything that matches that exactly. There was someone who posted not long ago that they observed what appeared to be a memory leak, but I don't know that they definitively correlated it to IPsec.
If it was something in IPsec settings, it would come up within hours/days after IKE/Child SA entries get renewed or rekeyed and so on. If it's going for weeks/months it's unlikely to be anything in your configuration.
Something like that might be solved by a new version of strongSwan or if it's an OS bug it may be fixed in a more recent version. If possible, give 2.4.5 snapshots a try and see if they behave better. 2.4.5 is very close to release, though, so if you wait a couple weeks you can just upgrade to the release and then keep an eye on it from there.
-
Thanks Jim. I'll just wait for the 2.4.5 release and see if anything changes. It's worth noting that I only see this with AWS tunnels -- I have other VTI tunnels on the same pfSense instances that don't misbehave at all.
-
So, I have an update.. this still happens on 2.4.5, multiple pfSense implementations. I discovered a charon process using a significant amount of CPU time and killing it fixed the problem (presumably until the next time it happens). This particular time it occurred after 3 weeks of uptime. I have logs if anyone is interested.
-
<30>May 3 13:05:41 charon: 11[KNL] <con1000|3> querying policy 0.0.0.0/0|/0 === 0.0.0.0/0|/0 in failed, not found
Messages like this repeated over and over at alarming frequency. They still show up when the tunnels are working well, but at much lower frequency.
Count dropped off a cliff when I killed the charon process.