VTI to AWS becomes unstable after some time (weeks or months), requiring restart.
Hi all --
I've run across the same behavior now on a couple of different pfSense implementations and AWS tunnels to different regions / VPCs.
The setup is standard VTI to AWS, two tunnels up simultaneously and BGP via OpenBGPD (because FRR still seems not to play nice with VTI). After a fairly long period of time (weeks to months), the tunnels start to show heavy packet loss -- 10% to 40% or more via gateway monitoring (and real world observation) and seem only to recover if I fail over to the secondary firewall in a HA pair and/or reboot the appliance.
This is happening both on a couple of pfSense CE VMs we run on prem as well as an Azure VM. I'm still in the information gathering stage right now, as I just saw the behavior on a second implementation for the first time. Just wondering if this is known; I searched and didn't find anything that looked like this specific issue.
Crypto is: IKEv2, AES256,SHA256,DH14. VTI (P2) is AES256,SHA256,DH14.
I'm not aware of anything that matches that exactly. There was someone who posted not long ago that they observed what appeared to be a memory leak, but I don't know that they definitively correlated it to IPsec.
If it was something in IPsec settings, it would come up within hours/days after IKE/Child SA entries get renewed or rekeyed and so on. If it's going for weeks/months it's unlikely to be anything in your configuration.
Something like that might be solved by a new version of strongSwan or if it's an OS bug it may be fixed in a more recent version. If possible, give 2.4.5 snapshots a try and see if they behave better. 2.4.5 is very close to release, though, so if you wait a couple weeks you can just upgrade to the release and then keep an eye on it from there.
Thanks Jim. I'll just wait for the 2.4.5 release and see if anything changes. It's worth noting that I only see this with AWS tunnels -- I have other VTI tunnels on the same pfSense instances that don't misbehave at all.
So, I have an update.. this still happens on 2.4.5, multiple pfSense implementations. I discovered a charon process using a significant amount of CPU time and killing it fixed the problem (presumably until the next time it happens). This particular time it occurred after 3 weeks of uptime. I have logs if anyone is interested.
<30>May 3 13:05:41 charon: 11[KNL] <con1000|3> querying policy 0.0.0.0/0|/0 === 0.0.0.0/0|/0 in failed, not found
Messages like this repeated over and over at alarming frequency. They still show up when the tunnels are working well, but at much lower frequency.
Count dropped off a cliff when I killed the charon process.