Found 6 issues with FRR/OSPF in pfSense 2.5.1
-
#1. SPF algorithm firing causes my OSPF "redistribute connected" routes to flush, taking up to ten seconds to repopulate. That shouldn't happen. High severity.
#2. OSPF protocol filtering (FRR GUI - Global Settings / Route Handling) causes FRR to do strange things** making all my OSPF routes invalid, and often crashes FRR itself, where the only remedy is a firewall reload, as not even the shiny new red "Force Service Restart" GUI option can recover the situation. High severity.
**FRR "show running-config" shows the prefix constantly changing seq numbers every second in the ACCEPTFILTER prefix-list. It's weird to say the least, and would seem a pfSense coding bug.
#3. ACL's no-longer have an implicit deny at the end. This is different behaviour to pfSense 2.4.5p1. I noticed this when I was doing redist connected with a distribute-list (which called a "Zebra" ACL) and everything then gets redistributed in 2.5.1 (and in 2.5.0). That didn't happen in 2.4.5p1. ACLs should always have an implicit deny at the end rather than implicit allow. Probably an FRR upstream issue. Potential problem for upgraders. Can be mitigated with explicit "deny any" commands but shouldn't have to with correct behavior.
#4. OpenVPN links re-establishing can cause "onlink" routes to become inactive, regardless of whether "Ignore IPsec restart events" is ticked. Not sure why. High severity. Might be related to other issues detailed here.
#5. Making changes via the GUI in any section causes the ACCEPTFILTER prefix list entries to be duplicated in the FRR config "show running-config". They have unique seq #'s, but they are repeated over and over again. Not sure if service impacting, but sloppy coding non-the-less.
#6. Interface descriptions are cumulative, in that an interface below in the config contains the strings of all interface descriptions above it, plus its own description. This "repeating" behaviour is similar to point #3. More sloppy coding it would seem.
I'm about to clone my GNS3 POC lab swapping out with other FRR devices to get an idea of what might be caused by pfSense, and what might be upstream FRR issues. This will take some time. Perhaps all weekend at best, otherwise it'll be next week.
If this thread gets too big, then it might be best to start individual topics on the issues. See how it goes. To the Netgater's... I've been working on OSPF full-on since the days of dial-up, 25+ years in ISP and enterprise networking roles.... that context sometimes helps.... maybe.
-
@gcon Anyone from Netgate seen this yet? Anyone? Well I guess the last two days have been a weekend. Hopfully someone at Netgate is reading this come Monday morning.
For #1 - Netgate - Imagine two firewalls where they have primary and backup paths to each other (which I do). This is a connected redistributed route on firewall B, being sent to firewall A.
host.domain# show ip route ospf | include 10.24.194.0/24 O>* 10.24.194.0/24 [110/20] via 10.255.195.2, ovpns2 onlink, weight 1, 00:01:04
Now with this in OPNsense if I drop the the backup link which is ovpns3, either dropping it from the server end or the client end, that route stays pinned up and solid - as it should.
But if I do the same thing with pfSense - it'll drop the route on the primary link as well, and cause about ten seconds of outage. This is important as we have VPN users coming in on the primary link from that source subnet (VPN users) and the backup link is a 4G that sometimes goes offline from time to time - as wireless networks sometimes do. With pfSense that causes an outage. With OPNsense it doesn't, as OPNsense doesn't have this issue.
I have been waiting for pfSense to fix this since 2.4.5. Wasn't fixed in 2.4.5p1. Wasn't fixed in 2.5.0. Still isn't fixed in 2.5.1. Guys - this is a critical bug!
For the issues I reported about - issues #2, #4 and #5 - they are all related to the ACCEPTFILTER instablity / screw ups. Fix that and you'll fix those three issues.
Issue #3 - am not sure but that might be a FRR issue. I can't replicate in OPNsense as they just have prefix lists rather than ACLs, so you're on your own there. That might be the cause a lot of upgrader grief though, so you should make note of it for upgraders, or you'll just lose more customers from upgrade frustrations.
Issue #6 - again I cannot replicate in OPNsense as they don't populate their FRR configs with interface descriptions. I'd guess that it's your PHP code but I haven't looked into it (hence why it's a guess). Minor issue but not a good sign of code quality.
Now that OPNsense does all that I need it to do, I'm finally in the position where I can choose, so if anyone from Netgate is interested in keeping this customer (and may others who silently up and leave) then show some wilingness to sort this FRR routing mess out. Otherwise, as is the benefit of a free market economy, I'll up and leave for proven better alternatives. Thank you.
-
Hi,
why do you not open a redmine bug report?
Bug Report
Have a good week,
fireodo -
@gcon said in Found 6 issues with FRR/OSPF in pfSense 2.5.1:
#6. Interface descriptions are cumulative, in that an interface below in the config contains the strings of all interface descriptions above it, plus its own description. This "repeating" behaviour is similar to point #3. More sloppy coding it would seem.
fix is ready: https://redmine.pfsense.org/issues/11768
Please create bugreports for other issues,
see https://docs.netgate.com/pfsense/en/latest/development/bug-reports.html -
@viktor_g I have logged th issue of OSPF FRR redistibuted connected routes going missing to Redmine as https://redmine.pfsense.org/issues/11835
At this stage I believe there are two other issues - the ACCEPTFILTER issue, and change of default behaviour in ACLs. I will do more lab testing on those with the intention of logging those as well.
Thanks
-
@viktor_g said in Found 6 issues with FRR/OSPF in pfSense 2.5.1:
fix is ready: https://redmine.pfsense.org/issues/11768
The Link in the Fix does not work!
-
@fireodo said in Found 6 issues with FRR/OSPF in pfSense 2.5.1:
@viktor_g said in Found 6 issues with FRR/OSPF in pfSense 2.5.1:
fix is ready: https://redmine.pfsense.org/issues/11768
The Link in the Fix does not work!
Try this patch: 80.diff
see https://docs.netgate.com/pfsense/en/latest/development/system-patches.html -
@viktor_g Hi Victor. I have also logged the ACCEPTFILTER prefix-list issue here:
https://redmine.pfsense.org/issues/11836
I will now investigate more thoroughly the deault behavior of Acess Control Lists, that seem to have switched from implicit deny at the end, to an implicit accept at the end, in my previous lab testing. If true (which seemed to be the case), that would also be another massive source of upgrade headaches.
-
@viktor_g said in Found 6 issues with FRR/OSPF in pfSense 2.5.1:
@fireodo said in Found 6 issues with FRR/OSPF in pfSense 2.5.1:
@viktor_g said in Found 6 issues with FRR/OSPF in pfSense 2.5.1:
fix is ready: https://redmine.pfsense.org/issues/11768
The Link in the Fix does not work!
Try this patch: 80.diff
see https://docs.netgate.com/pfsense/en/latest/development/system-patches.htmlThank you!
Kind regards,
fireodo -
@fireodo Phew! Last bug logged for my "issue #3" in this thread https://redmine.pfsense.org/issues/11841. I hammered away confirming that access lists now behave way differently in 2.5.x, defaulting to an implicit "permit any" rather than implicit "deny any".
This has huge ramifications for upgraders. It hit me like the proverbial tonne of bricks. Another reason why people have griped to me about the 2.5.x and routing issues I'd say.
-
@gcon said in Found 6 issues with FRR/OSPF in pfSense 2.5.1:
I will do more lab testing on those with the intention of logging those as well.
I have this issue with a client. Running FRR ospf and peering over ipsec VTI. Some routes stop working for no reason. They are in the ffr daemon but do not populate the route table under diagnostic>routes. Reseting ospf daemon fixes the issue. Also checked the "ignore ipsec restart events" to no avail.
Did you ever figure this out?
-
@hempfieldtech
I logged about 3 or 4 FRR-related issues. I saw that the ACCEPTFILTER bug already had a bug entry, as I didn't know at the time that the packages used a seperate bug tracking system.
For your issue, do the routes ever come back on their own if left long enough, or is the only fix resetting ospf/FRR? For my connected redistributed routes disappearing - they come back on their own, but they should never have dropped to begin with. Sounds like you might be dealing with a different issue.If you have a lab setup (GNS3 perhaps) you could try replicating it there and try substituting a VyOS or OPNsense device to see if it is happening their as well. Or even just a generic FreeBSD (or Linux) setup with FRR installed. Since I'm not seeing any urgency to the issues I logged, I have moved to OPNsense already as the routing issues I faced with pfSense are all fixed there. Will keep tracking these issues in pfSense ocassionally to see if/when they are addressed by Netgate engineers.
-
@hempfieldtech
Did you discover any more info?
How frequently did you encounter this?I think I saw this today on a site ( Italy )
Routes were present in the OSPF table.
One was missing under Diagnostic / Routes for at least 10 hours before restarting FRR brought it back.Restarting FRR Ospfd didn't bring the route in.
Restarting FRR Core Zebra did.The spoke off of London was present.
But the route to London ( hub) it self was missing.2.5.2-RELEASE
Package version frr net 1.1.0_15
About 3 months since last config change -
@ay Hi there. It's been 8 months since I did my deep-dive into pFsense dynamic routing with pfSense. I do recommend getting a GNS3 lab together though if you can do that - it's great for testing. I'll get back into pfSense testing probably this month when the new version comes out.
-
I setup a cron job to restart the OSPFD on schedule every morning. IT was the only way to overcome the route issue although i have not investigated further since. This is actually a smooth way to do it and it doesn't cause much of a blip on traffic.
-Favyan
-
@gcon I'm experiencing #4 as well and I can reproduce it consistently. I paid for support and opened a case with Netgate TAC and after looking at things for a couple days, turning on extended logging, and having me reproduce multiple times, I get, "Can you try upgrading to the 2.6 release candidate?" I mean I can and I will next week since I set up a whole test site to try to work with them on this, but it doesn't seem like they have any idea and based on the lack of response on your Redmine bug reports, I'm not confident anything will have changed. We've used PFSense for years and been pretty happy, but they don't seem to be treating critical FRR issues with any urgency as this issue started with 2.5.x over 8 months ago.
-
@mdomnis Unfortunately, I am able to reproduce the undesirable behavior (similar to your #4 it would seem) on 2.6 RC as well. :( Waiting for TAC to give me something else to try or have someone dig in further.
-
Summarising the initial 6 things I raised:
1. #1. SPF algorithm firing causes OSPF "redistribute connected" routes to flush.
This was raised in #11835.
I can see that no one has worked on this critical bug. I have tested and this is still an issue (!!)#2. OSPF protocol filtering (FRR GUI - Global Settings / Route Handling) causes FRR to do strange things (and make OSPF routes invalid / crash FRR etc)
I avoid the "FRR GUI - Global Settings / 'Route Handling'" way of filtering as I found that too unstable so haven't tested it since finding it a problem. I have done filtering elsewhere on my Mikrotik routers instead.I did raise 11836 for a related issue, and some things improved there, but not sure if this actual issue is fixed or not. Since I don't use the "route handling" features I stopped looking at this issue.
#3. ACL's no-longer have an implicit deny at the end.
I did raise 11841 but I am not looking at this issue as I found that prefix-lists weren't affected so I swapped over from access-lists (ACLs) to prefix lists for my needs (for the redistributing of specific connected routes into OSPF).#4. OpenVPN links re-establishing can cause "onlink" routes to become inactive
@mdomnis How did you end up going with this? I didn't actually raise a ticket for this but you've been working with pfSense on it I see. I'm not seeing it in pfSense 2.6, but my test lab might be different to when I had it last. Solved?Issues #5 and #6 - ACCEPTFILTER prefix list entries to be duplicated, and Interface descriptions cumulative
These got fixed - am not seeing these issues in pfSense 2.6. They would have been pretty trivial to sort out.======
so in short, #5 and #6 are fixed. #4 seems to be fixed (to be confirmed).. #2 and #3 - I have worked around (have avoided those features, thus I'm not affected).The only thing that that I am affected by right now (and cannot avoid) is issue #1. And it's still really bad. Here's one of my connected routes dropping the moment a backup link comes back up:
O>* 10.24.194.0/24 [110/20] via 10.255.195.2, ovpns2 onlink, weight 1, 01:07:33
pfsense01.it.somecompany.com.au# show ip route | include 10.24.1
O>* 10.24.194.0/24 [110/20] via 10.255.195.2, ovpns2 onlink, weight 1, 01:07:35
pfsense01.it.somecompany.com.au# show ip route | include 10.24.1
O>* 10.24.194.0/24 [110/20] via 10.255.195.2, ovpns2 onlink, weight 1, 01:07:37
pfsense01.it.somecompany.com.au# show ip route | include 10.24.1
O>* 10.24.194.0/24 [110/20] via 10.255.195.2, ovpns2 onlink, weight 1, 01:07:38
pfsense01.it.somecompany.com.au# show ip route | include 10.24.1
O>* 10.24.194.0/24 [110/20] via 10.255.195.2, ovpns2 onlink, weight 1, 01:07:40
pfsense01.it.somecompany.com.au# show ip route | include 10.24.1
pfsense01.it.somecompany.com.au# show ip route | include 10.24.1
pfsense01.it.somecompany.com.au# show ip route | include 10.24.1
pfsense01.it.somecompany.com.au# show ip route | include 10.24.1
pfsense01.it.somecompany.com.au# show ip route | include 10.24.1
pfsense01.it.somecompany.com.au# show ip route | include 10.24.1
O>* 10.24.194.0/24 [110/20] via 10.255.195.2, ovpns2 onlink, weight 1, 00:00:01
pfsense01.it.somecompany.com.au# show ip route | include 10.24.1
O>* 10.24.194.0/24 [110/20] via 10.255.195.2, ovpns2 onlink, weight 1, 00:00:03
pfsense01.it.somecompany.com.au# show ip route | include 10.24.1
O>* 10.24.194.0/24 [110/20] via 10.255.195.2, ovpns2 onlink, weight 1, 00:00:04
pfsense01.it.somecompany.com.au#10.255.195.2 is the far end of the primary link (p2p). The backup p2p link re-establishing should not cause this route learned over the primary link to flush and relearn. I'm testing pfSense 2.6.0-RELEASE which is built on FreeBSD 12.3-STABLE and has FRR version 7.5.1
update: I cloned my lab and updated pfSense to 2.7.0
2.7.0-DEVELOPMENT (amd64)
built on Mon Oct 17 06:04:34 UTC 2022
FreeBSD 14.0-CURRENTIt is still happening on there. The FRR on 2.7 is still only 7.5.1. Why so old? https://frrouting.org/release/ That's from March 7 2021. FRR is up to 8.3.1 now - 5 releases on from that. Really would like to see what happens in a later version of FRR and hoping the devs can update the FRR package to the latest release soon.