Changes to IPsec tunnels leads to routing instability
-
Wondering if anyone has come across this. I have filed this under a redmine - https://redmine.pfsense.org/issues/14483#change-68058
This potential issue is enough to make explore other optionsI have a hub with 4x IPsec VTI tunnels. Each tunnel is running eBGP. The problem i discovered is that when I make any changes to a single tunnel (doesnt matter which one) if the change is a description change OR a Phase 1 parameter change it doesnt matter, once i apply those changes all 4x tunnels lose routing briefly. BGP flaps. From my experience with other platforms, this should not happen at all.
Its so bad that if i bring up another tunnel and apply changes, routing breaks for all tunnels briefly and comes back up. This is really unreasonable.At first i thought this was a gateway monitoring action but those have been disabled.
Then maybe thought its a package issue so i disabled pfblocker [ive had issues with this package impacting reliability of the system n the past]Today one of the things i noticed when making a interface change is the following in the logs
-
@michmoor
i suppose you use the Frr package. I noticed this behavior some years ago. It was never solved, it happens to OSPF too, which makes this dynamic routing packages on Pfsense really worthless,
Its not clear for me if the problem is on the FRR routing software site (that is beta software) or just the way Pfsense implements routing table changes. So we will see,
whether your Redmine ticket gets any attention, but i dont think so.Read this and some other topics: https://forum.netgate.com/topic/145653/ffr-restart-on-configuration-changes?_=1687120051977
-
@pete35 This is VERY frustrating at this point. The Redmine was closed and you can tell no thought was given other than the following "fix"
"This is part of the reason why the option Ignore IPsec Restart in FRR exists."That option is enabled for me...The ENTIRE POINT of opening the redmine is because there is something broken in the way FRR or IPsec is handled within pfsense.
I just did a description change on one of my tunnels. As you can see all my tunnels routing peers flapped
As I also made mention in the ticket i informed NetGate that i am trying this on another FreeBSD system (*sense) which I will not name and this issue does not exist there. This is specifc to pfSense.
To reject the ticket without even mentioning if you were able to replicate it is really bad form. Ive worked with Marcos (yes calling him out by name) and hes professional so not sure why the ticket was dismissed in this way.
@stephenw10 can you assist here. Can you see if you are able to replicate this issue in a lab? If possible re-open the Redmine again.sh ip bgp summary IPv4 Unicast Summary: BGP router identifier 192.168.50.254, local AS number 65001 vrf-id 0 BGP table version 981 RIB entries 50, using 9600 bytes of memory Peers 4, using 57 KiB of memory Peer groups 1, using 64 bytes of memory Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd PfxSnt 10.6.106.6 4 31898 3286 3426 0 0 0 00:00:36 2 6 10.6.106.10 4 31898 3218 3334 0 0 0 00:00:37 2 6 10.6.106.2 4 65520 14803 15170 0 0 0 00:00:50 2 6 172.28.0.5 4 65002 14665 14676 0 0 0 00:00:48 1 2
What happens when i make a change in IPsec
Setting enabled
-
@pete35 said in Changes to IPsec tunnels leads to routing instability:
or just the way Pfsense implements routing table changes
I can confirm working on a lab opnsense machine this problem is not present there. There is something in the way pfsense is implementing frr or ipsec.
-
The situation with tickets for this issue was expected. There are more open issues with dynamic routing. If you can go with opensense just go ahead.
-
@pete35 FRR ha been stable for me through my deployments both personal (home lab) and professional as an MSP. Using bgp on the edge is fine.
I noticed certain configurations do not work well such as IPsec and Routing. This is a critical deployment where swapping out for another platform is a headache but not impossible.
I’m honestly still shocked the redmine got rejected but then got reopened and an acknowledgment that this could be replicated. -
@michmoor
Yes there is hope. But on the otherside if you look on the open bug tickets at redmine, there are about 20 which are older then 3 years. You need some patience.
jimp mentioned dont change anything during work hours ... practical solution but neverthenless very unsatisfactory. This stops me from implementing dynamic routing with pfsense. -
@pete35 I cant believe the solution is to not make changes during working business hours which is so confusing.
The Redmine i submitted, the solution makes it seem as though a simple checkbox would solve all my issues but that checkbox doesnt do anything from what i can tell. As a test i reverted my test system back to 22.05 with the same issues. So confused as to what the checkbox does in FRR.
This is about money for me at the end of the day. This is a hub and spoke topology where I am doing a rip and replace. I spec'd out this project for a 8200 at the hub. Spokes get migrated over the next several months. This was just a POC (proof of concept) I was building but now I have to go back to the client and spec something completely different which is annoying and embarrassing.
I wish I knew FRR in an enterprise setting does not work. The blame completely falls on netgate here. The package is broken with no solution but its included. Why? No bias here, if another vendor pulled this stunt i would have the harsh words for them as well.
I cant even have a very basic IPsec/BGP tunnel running.I was looking at TNSR for one job but i have a big confidence issue right now with this product. There is no telling what works and what doesnt without relying on good people like @pete35 to provide a forum link. A forum link....
A firewall needs to route. This is as basic as it gets. As an implementor, i need to be able to trust the knobs that are presented to me in the GUI. This is a reevaluation point for me and its really unfortunate. -
@michmoor
It should be possible to load/run a Netgate 8200 with Opnsense, if that works for you, you POC might work too.
If it doenst work you can always reload pfsense onto the 8200 and sell this unit.
For routing TNSR maybe possible but as far as i see, which much higher yearly costs. -
@pete35 Update on this.
I secured this contract. We're using a pair of Juniper SRX 380s as we got 10Gbps dual DIA circuits
I am posting lessons learned for posterity's sake and a cautionary tale for others who search for something similar to this.
-
pfSense cannot perform advanced routing in a stable way. FRR needing to be reloaded for changes is a problem that i do not blame pfSense on. Thats just the way the package currently functions but still should be taken into account. I got over 15 sites in a hub and spoke set up. If i update frr im breaking connectivity for all sites. When won't I have to make a route map change? Add a new BGP neighbor? There is no maintenance window in the world that a company would approve a global outage. There are workarounds for this I suppose but not worth exploring.
-
As I outlined in my redmine, there is an issue with IPsec that impacts FRR in a negative way. The problem isnt with FRR.
If there is a need to do routing over IPsec (obviously utilizing VTIs) then pick another firewall. Imagine you have a datacenter terminating over 50 IPsec tunnels. All you do is update the IPsec configuration or even onboard another site and click apply. You just broke routing within the enterprise. Thats absolutely insane and scary. This is something that can be replicated by TAC per the redmine. I cant recommend in good conscience deploying pfSense in that situation.
I got extremely lucky in that my client paid thousands of dollars on the 6100s to make the sacrifice of getting the Juniper head-end SRXs to manage all of this.
I really do advise anyone reading this to reconsider something else if your solution requires dynamic routing with IPsec. Beware,.. -
Lastly, there are lots of things that pfSense gets right. I will continue to deploy it in much less advanced scenarios but cannot use it going forward on topologies that require High availability with routing. The software just cant do it. This was indeed an eye-opener for me but we all learn the hard way.
-