Pot. Bug: OSPF routes via OVPN lost or not refreshed in routing table
-
Hello,
before creating a redmine ticket I thought to bring the topic here for discussion if we forgot to test something:
Situation:
The situation is a relatively simple setup of three locations of a customer. Said customer has a main location as well as two branch locations. Main location is running two WAN uplinks, branches only have a single connection.
All locations are connected to all others (triangle) and the two brances have a connection to both WANs on the main site, one with a higher cost to make them unattractive for using normally.So normally the process should be: three tunnels per branch are up (one with each other, two with WAN1/2 of main site). That's the case. OVPN is using simple shared-key tunnels with a /30 subnet and has no static routing configured so OSPF can do its job. Also no OVPN interface is assigned via "assign interfaces". Simple & straightforward. Also all locations are running on 2.5.2-latest CE and have the newest FRR package installed.
Now the FRR/OSPF setup is pretty simple. All locs have their name, router-id and area 0. Other then having a password set up, all locations have "redistribute connected" active in OSPF settings as otherwise the local dial-in VPN networks wouldn't get picked up correctly. As the local networks all should get distributed with that setting, the only three interface configurations in OSPF are the three OpenVPN connections. All are set up with
area 0
,cost 10
(only the second WAN conn to the main site is runningcost 20
) andnetwork point-to-point
(even if recognized correctly we tried setting it in case that changes anything).Sample of the two locations:
##################### DO NOT EDIT THIS FILE! ###################### ################################################################### # This file was created by an automatic configuration generator. # # The contents of this file will be overwritten without warning! # ################################################################### ! frr defaults traditional hostname sch password xxx log syslog service integrated-vtysh-config ! ip router-id 192.168.40.254 ! ip route x.y.z.a/32 Null0 ip route a.b.c.d/30 Null0 ! interface ovpnc2 description "ospfd: SCHW -> AUG" ip ospf network point-to-point ip ospf cost 10 ip ospf area 0 interface ovpnc4 description "ospfd: SCHW -> MUEWAN2" ip ospf network point-to-point ip ospf cost 20 ip ospf area 0 interface ovpnc1 description "ospfd: SCHW -> MUEWAN1" ip ospf network point-to-point ip ospf cost 10 ip ospf area 0 ! router ospf ospf router-id 192.168.40.254 log-adjacency-changes detail redistribute connected ! ip prefix-list ACCEPTFILTER deny 10.12.3.0/30 ip prefix-list ACCEPTFILTER deny 10.12.3.2/32 ip prefix-list ACCEPTFILTER deny 10.12.2.4/30 ip prefix-list ACCEPTFILTER deny 10.12.2.6/32 ip prefix-list ACCEPTFILTER deny 10.12.2.0/30 ip prefix-list ACCEPTFILTER deny 10.12.2.2/32 ip prefix-list ACCEPTFILTER seq 10 permit any ! route-map ACCEPTFILTER permit 10 match ip address prefix-list ACCEPTFILTER ! ip protocol bgp route-map ACCEPTFILTER ! ip protocol ospf route-map ACCEPTFILTER ! ipv6 protocol bgp route-map ACCEPTFILTER ! ipv6 protocol ospf6 route-map ACCEPTFILTER ! line vty ! end
##################### DO NOT EDIT THIS FILE! ###################### ################################################################### # This file was created by an automatic configuration generator. # # The contents of this file will be overwritten without warning! # ################################################################### ! frr defaults traditional hostname aug password xxx log syslog service integrated-vtysh-config ! ip router-id 192.168.1.1 ! ip route a.b.c.d/30 Null0 ip route x.y.z.a/32 Null0 ! interface ovpnc2 description "ospfd: AUG -> MUEWAN1" ip ospf cost 10 ip ospf area 0 interface ovpnc4 description "ospfd: AUG -> MUEWAN2" ip ospf cost 20 ip ospf area 0 interface ovpns1 description "ospfd: AUG -> SCHW" ip ospf network point-to-point ip ospf cost 10 ip ospf area 0 ! router ospf ospf router-id 192.168.1.1 log-adjacency-changes detail redistribute connected ! ip prefix-list ACCEPTFILTER deny 10.12.1.0/30 ip prefix-list ACCEPTFILTER deny 10.12.1.2/32 ip prefix-list ACCEPTFILTER deny 10.12.1.4/30 ip prefix-list ACCEPTFILTER deny 10.12.1.6/32 ip prefix-list ACCEPTFILTER deny 10.12.3.0/30 ip prefix-list ACCEPTFILTER deny 10.12.3.1/32 ip prefix-list ACCEPTFILTER seq 10 permit any ! route-map ACCEPTFILTER permit 10 match ip address prefix-list ACCEPTFILTER ! ip protocol bgp route-map ACCEPTFILTER ! ip protocol ospf route-map ACCEPTFILTER ! ipv6 protocol bgp route-map ACCEPTFILTER ! ipv6 protocol ospf6 route-map ACCEPTFILTER ! line vty ! end
<So should be pretty simple, shouldn't it?
Issue:
#1) Problem started with testing the scenario after customer hat a failed uplink of one of the WAN uplinks on main site. The routes got shoved over to the second WAN link alright, but after the failed link came back, the routes didn't switch back to the now again functional main VPN connection.
#2) OK that may have been a problem because of the two uplinks so we tested with the two branch offices. If their direct connection fails, the link to the main office both have should get them connected via that one. And after stopping OVPN of that tunnel on Branch#1 (e.g. SCHW) you see the routing change from the failed OVPN link to the one to the main office. OSPF shows that route with higher cost (20 -> as it's 10 to the main and 10 to the other branch). If one restarts the stopped OVPN link then, three things CAN happen:
-
in rare cases, everything works as expected. e.g. you see the VPN link come up, shortly after you can see the routes popping back to the back-again VPN interface in Zebra and OSPF status and in
Diagnostics/Routes
you can see the route for the LAN of Branch#2 shifting from the main-site-OVPN-link to the direct-OVPN-link interface name again. YAY. Except it's only in rare cases -
in a few cases... nothing happens. The routes in
Diagnostics/Routes
stay the way they are and still route everything via the main office (you get ping responses that have roughly double the latency they normally have what's to be expected. And they never shift back. OSPF/Zebra recognizes the link back up and OSPF as well as Zebra routes show the routing via direct VPN is better and fine and working! But somehow OSPF/Zebra doesn't get it down to the system level routing table and as that is still having the old route nothings changes. -
in quite a few cases, almost commonly, something in between is happening. The routes don't stay as they are but are instead removed - but nothing is added back again. Zebra and OSPF routes still show everything that you would expect them to show and everything in there reads fine and good, but the system routing table never changes.
For better understanding, here are some screenshots of the process:
OSPF configured with "Connected Networks":
OSPF Interface configuration of site Branch#1 (SCHW). As described, two tunnels to Main Site, one to the other Branch.
Interface config is pretty straighforward like @jimp had it in his video of OSPF via OpenVPN:
Zebra and OSPF route entries when everything is up and running before a failure. The networks 192.168.1.0/24 and .10.0/24 are located in Branch2 and are now reachable via direct VPN connection:
The System Routing Table agrees:
so now we stop the direct VPN connection between SCHW and AUG (Branch1/2):
failover in Zebra and OSPF routes is visible, both networks are now routed through the main office:
That also gets down to system routing table:
so we restart the OVPN link to Branch again and see what happens:
Zebra and OSPF actually recognize the link ovpnc2 coming up again, but strangely have all VPN routes now listed as "inactive" in Zebra, but OSPF routes shows them "fine" with again the cost of 10 instead of 20:
But: system routing table has both networks "gone" (as in case #3 above). The networks have simply been deleted the moment the link came up but didn't get replaced with the new connection via ovpnc2 again:
--
As described, sometimes the routes "hang" in there and are still listed as ovpnc1 and seem to be kind of working (e.g. get a ping response) but more commonly the routes simply get deleted. The deletion always goes hand-in-hand with the interface coming up again.
So somehow it seems the interface-up triggering of pfSense seems to "clean up" the routes but something with FRR get's somehow (often/nearly always) handled wrong or in an untimely manner (a timing problem or race condition perhaps?) that FRR's Zebra and/or OSPF get stuck with reporting all correct but don't actually make any changes to the system's routing table at all.
Only possible solution is to restart both, Zebra & OSPF portion of FRR then all routes vanish for a moment and come back the way they should be (like in screen #4/5) with correct metric, weight, interface etc. bringing a link down (physically pulling the cable) or stopping the OVPN process has the same result. As soon as the interface or the VPN link gateway is brought up again, Zebra & OSPF find it, but are somehow unable to make the necessary changes to the underlying system routing table.
As this is the second time now I have seen the problem of "vanishing routes" with FRR (one or more routes get dropped but not reinstated but Zebra/OSPF actually lists it right) I'm coming here to somehow check with you guys, if that's a bug, dependency/race condition/timing problem or a configuration thing, but as the two customers I have encountered that specific problem had vastly different system configurations, I'm currently more leaning towards it a problem coming from the FRR or pfSense-side of things instead of being a simple config mistake.
Also both occurences only happened after both customers updated to 2.5.2 as before the system was running fine and without bigger problems. So perhaps the problem is somehow related to things that changed between 2.4.5 and 2.5.2 as well as the FRR packages (I think it was 0.6.x to now 1.x) that may trigger that problem.
I'm open for questions, suggestions or anything other helping us solve this - and if I read right not only us as I also found a few topics in the forums about FRR behaving strangely in 2.5.2/1.x packet version with OSPF.
Thanks in advance!
\jens -
-
@jegr I've also had problems with OSPF/OpenVPN using 2.5.x. Previously with 2.4.5_p1 and below, everything worked perfectly. I've seen other posts about the same issue with IPSec so I think it's more to do with FRR, not really anything to do with OpenVPN.
What I've noticed is I seem to sometimes lose routes connected via the VPN from the system routing table (still present in Zebra, but listed as inactive) after a WAN outage - specifically, it seems to occur when the WAN stops passing traffic, but the WAN interface on the pfSense remains up. I am able to get the routes back into the system routing table and get things going again by restarting Zebra.
I'd go as far as to say I can reproduce this condition (routes present, but inactive in Zebra, but not in the system routing table) 100% of the time by doing the following:
1.) Inserting a switch between my wan interface on the pfSense and my wan connection - this will keep the interface on the pfSense UP even if the WAN connection is actually down.
2.) Cold boot or reboot the pfSense with the WAN interface connected to the switch, but the actual WAN connection disconnected from the switch.
3.) After boot up completes, plug in the WAN connection to the switch.At this point, I'd expect the tunnel to come up (it does) OSPF convergence and the routes to eventually be placed into the system routing table. But this doesn't happen until I restart Zebra.
I will retest again next week, but now you have me wondering if I were to then disconnect the WAN interface of the pfSense from the switch and then plug it back in, if this would also get things going again.
In either case, it seems there's a condition where something is not triggered appropriately. Just losing the ability to pass the traffic and then regaining it can leave things in a broken state as far as VPN connected routes not being placed back into the system routing table.
If there's something I can share from logs that would be helpful to determining what's going on, I'd love to get to the bottom of this one as well.
-
@jegr Have you had any more luck with this? I've noticed some weird issues lately using OSPF over redundant OpenVPN tunnels. After digging into it some more, I'm seeing pretty much the same thing you are seeing. There is no failover after a tunnel drops, even though OSPF shows the correct route! The route is just showing in Zebra as "inactive." So frustrating!
-
@wblanton I don't believe this issue is fixed. My best suggestion would be to try your configuration while leaving the OSPF interface option for "Accept Filter - Prevent routes for this interface subnet or IP address from being distributed by OSPF (Suggested for Multi-WAN environments)" UNCHECKED for all OSPF interfaces. Logic and advice from support/Jim Pingle hangouts had me checking this on both my tunnel interfaces and my internal LAN interfaces, but in doing so, I was getting the routes periodically set to Inactive in Zebra and they were not placed into the system routing table for use so it was not working well at all on any version post 2.4.5_p1 (2.5.0, 2.5.1, 2.5.2 or 2.6).
In my opinion, this is an FRR/OSPF issue and nothing to do with OpenVPN or IPSEC VTI as I've had the behavior occur using either. Had a big support case open with lots of back and forth and never really got it resolved. They had me trying development releases, switching from OpenVPN to IPSec, etc. Kind of seemed like I was dealing with level 1 and they were reluctant to get anyone with real smarts involved. Annoying.
I'm following this bug, but no action in a while on it. https://redmine.pfsense.org/issues/11836
-
@mdomnis I have since upgraded to 22.01 with FRR version 1.1.1_6. In my preliminary testing, the routes seems to be working closer to what is expected. I still have a weird issue where sometimes the neighbors don't like to peer fully and I have to force restart FRR, but from some quick tests, it looks like at least the route is being added to the table correctly. For now at least.