Planning IPSEC changes - dynamic routing questions
-
Quick Summary: I need feedback on what will work to switch between multiple WAN providers at both ends of tunnels between 2 sites with extremely short to no downtime. Dynamic DNS is not fast enough.
The details:
We are a medium sized business that currently have a datacenter (DC) with several other locations in a hub and wheel configuration using static IPSEC tunnels to connect each site. All sites use pfSense. Each site has a tunnel back to the DC, and each site connects with at least one other site. Routing between sites is static, so we can connect to any site from the DC and from any other site it is connected to. We can NOT currently route traffic from one one site to a destination site through an interconnecting site. Some sites have multiple WAN providers, and some have only 1 depending on availability. We use a dynamic DNS name for automatic switching tunnels between the DC and one site when a provider goes down for that site. It mostly works, but switching is slow.We are now going to have at least one remote facility that operates 24x7 and needs to maintain connectivity to the DC at all times. The main goal is to have high availability setup with 2 routers at both ends of a tunnel (HA), and maintain connectivity with minimal downtime during transitions in case any single point drops. Going down for 5-10 minutes while DNS updates is too long. I'm looking for feedback on what dynamic routing solution would be recommended to change to. I've tested a couple of possible solutions (VTI and tinc), but it was on an earlier version of pfSense (2.4.5 or 2.4.6, I think). I'm hoping one of those has improved, or something else is possible.
An example scenario: we have the DC with WAN provider A and B, and we have a site with their own WAN providers A and B. With VTI for example, we could have DC WAN A create a tunnel with site WAN A, and DC WAN B create a tunnel with site WAN B. If the DC WAN B goes down at the same time that the site's WAN A goes down, then my understanding is that both tunnels would then fail at the same time even though both sites have a working WAN connection.
Is their a configuration or tool that I'm overlooking that will connect any WAN at either location, so that the tunnel is maintained or recreated with minimal downtime even if each site loses one of their WAN providers like in the above example?
-
@thale
We have a pair of pfSense devices (v2.5.2) in a CARP setup at our corporate network with dual internet lines.
We have 4 IPsec VPN tunnels to a cloud vendor which were set up using Routed (VTI) mode, 2 tunnels on each corp-side internet line.
We run FRR BGP which seamlessly fail traffic over between the tunnels.
This seems to work very well and we have had zero issues with that setup so far. -
@to2020 Thanks for the reply. How long does your BGP implementation take to failover in the event of an outage affecting your primary tunnel?
-
@thale
I have seen a single ping packet timeout during route convergence. While I have not tested this specifically, but real-time sessions (voice/video streams) will possibly get dropped. -
@to2020 Do you change the BGP timers or other settings to get that sort of convergence time? BGP is not usually a quick convergence. I've been doing some testing in a lab (which is unfortunately limited to only 3 routers, so I can't really test CARP) and when I fail the WAN interface hosting the currently routed VTI tunnel it takes something like 2.5 minutes to start routing over a different tunnel. I haven't played with the timers yet, but I guess I assumed it wouldn't get to a low enough convergence to meet my boss's expectations.
I'm trying to get OSPF working now, since it's a faster convergence time. It may just be me, but in a couple of tests over the years I've always had trouble with OSPF, especially with getting neighbors to start talking. So if we could get BGP to work quickly enough, I'd love to use that instead.
-
@thale
Sorry for the late reply. I plan to perform some failover testing this coming Sat. I'll post back with my findings after that. -
@Thale
I performed some tests during the weekend.
On the corp side, I had a Windows 10 machine connected with continuous ping to a device in my cloud vendor network.
When I pulled the network cable of one of my ISP routers, I lost 6-7 ping packets before traffic was flowing over one of the tunnels on my secondary ISP.
I also tested with hitting Save at the same time on a Word document on the cloud file server. While Word appeared hung during the same period, it recovered and saved the document once traffic started flowing over the other tunnel.
On the corp side pfSense under BGP > Neighbors, each Neighbor is configured with the following Timers under Basic Options:
10sec Keep Alive Interval
30sec Hold Time
30sec Connect TimerRegards,
Thomas -
@to2020 Thanks! It's good to get some numbers around what other people are seeing, and what your interval & timer settings are at when you see those numbers. I appreciate your help and for sharing your results!
-
@thale
You are most welcome and good luck with your implementation.
Happy to provide further feedback on my setup if you have specific questions.