Updating to pfSense+ 24.3 breaks routing - kernel routes now gone
-
To clarify, is the default route is missing from both Zebra and the kernel, or just Zebra?
It crazily thinks that some high-cost OSPF route is a better option than its directly connected default gateway, and it shouldn't.
Is this happening while the lower-cost route exists in the kernel? If so, is that happening with newly established traffic as well (as in not traffic for which states already exist)?
-
Please test this patched frr 9.1 version and let us know if the issue persists.
-
@marcosm I tested that in my production lab by upgrading the lab PfSense Plus 23.x to 24.x and seeing the breakage (K routes disappearing), and then I stopped FRR and applied that patched version, and started it again - Kernel routes showing up. Rebooted - still have the K routes.
Is there a bug reference ID you can link to? I'm really curious! I've spent days on this and would love to find out.
Would you recommend I use this in production? Maybe I am best waiting for 24.8 - where perhaps an updated FRR build will have more testing? Then I can skip 24.3 altogether and just go straight to 24.8.
This patched version has this:
configured with: '--enable-user=frr' '--enable-group=frr' '--enable-vty-group=frrvty' '--enable-vtysh' '--disable-doc-html' '--sysconfdir=/var/etc/frr' '--localstatedir=/var/run/frr' '--disable-nhrpd' '--disable-pathd' '--disable-ospfclient' '--disable-pimd' '--disable-pbrd' '--with-vtysh-pager=cat' '--enable-backtrace' '--disable-config-rollbacks' '--disable-datacenter' '--enable-fpm' '--disable-ldpd' '--disable-doc' '--without-libpam' '--enable-rpki' '--disable-sharpd' '--disable-shell-access' '--enable-snmp' '--disable-tcmalloc' '--prefix=/usr/local' '--mandir=/usr/local/man' '--disable-silent-rules' '--infodir=/usr/local/share/info/' '--build=amd64-portbld-freebsd15.0' 'build_alias=amd64-portbld-freebsd15.0' 'PKG_CONFIG=pkgconf' 'PKG_CONFIG_LIBDIR=/wrkdirs/usr/ports/net/frr9/work/.pkgconfig:/usr/local/libdata/pkgconfig:/usr/local/share/pkgconfig:/usr/libdata/pkgconfig' 'CC=cc' 'CFLAGS=-O2 -pipe -fstack-protector-strong -fno-strict-aliasing ' 'LDFLAGS= -L/usr/local/lib -L/usr/local/lib -fstack-protector-strong ' 'LIBS=' 'CPPFLAGS=-I/usr/local/include -I/usr/local/include' 'CPP=cpp' 'CXX=c++' 'CXXFLAGS=-O2 -pipe -fstack-protector-strong -fno-strict-aliasing ' 'PYTHON=/usr/local/bin/python3.11'
I compared it to other builds and nothing stands out. SNMP was off in one of the builds (for CE) and one of the other builds had "--mandir=/usr/local/share/man" instead of "--mandir=/usr/local/man" so am thinking that the fix was more than just build config.
In case this info is still required.... even though the root cause seems to have been identified/fixed....
To clarify, is the default route is missing from both Zebra and the kernel, or just Zebra?
The default route was missing just from Zebra. It was in the kernel.
Is this happening while the lower-cost route exists in the kernel?
Yes that's right.
If so, is that happening with newly established traffic as well (as in not traffic for which states already exist)?
Yes. Internet web browsing to new websites was broken. Traffic would go from a workstation to the Mikrotik cloud core LAN router. The Mikrotik could see the default route to the local pfsense, and also a default route over the microwave link to the other site's pfsense. The microwave link has a high OSPF cost, so the LAN router would correctly send the Internet traffic to the local pfSense. But then the local pfSense had an OSPF-learned route to the remote site over the microwave link and no K route for the local connected gateway, and bounced the traffic back to the LAN router, which then sent it back to the local pfsense . Can see that with traceroutes - traffic oscillating between firewall and LAN router until TTL timeout.
I don't know how it all works, but my experience suggests that if a route exists in Zebra and is subsequently added to the Zebra FIB, then this is the forwarding that gets used. If Zebra has no RIB/FIB entry, then it falls back to the system RIB/FIB (as given by "netstat -rn") before failing. This layering would make sense so that Zebra can start and stop with the least amount of impact. It's a massive danger though when the kernel routes don't get pushed from system/kernel to Zebra, because an incomplete view can lead to extremely poor routing decisions.
-
We found what looks to be the root cause - info has been posted to the Redmine report.
The route redistribution issue still needs testing with the patched version, any help with that would be appreciated.
I suggest waiting until we pick back the fix to 24.03 for your production systems.
-
@marcosm said in Updating to pfSense+ 24.3 breaks routing - kernel routes now gone:
Please test this patched frr 9.1 version and let us know if the issue persists.
How do you install this? Sorry pretty new. Can I just scp this to my netgate 7100 and use some sort of package manager to install? Any particular process that won't break further releases?
-
@mAineAc See the previous comment.
-
@marcosm Yeah, after installing no change. rebooted no change. I don't see the default route in FRR and it is not redistributing the default route.
-
-
@marcosm I just tested in my production simulation lab and all looks good. I'll update the actual production firewall this weekend. This is a great result - thanks so much for your efforts - it's really appreciated.
-
@marcosm Will this be coming to 24.08.a.20240702.0600? I am running this and the package listed does not seem to work and i am still having the same issue. I have not seen any updated packages.
-
@mAineAc No - you'd have to build/install it manually for the public dev build. I'm not aware of any official bug report for the issue you're experiencing. My suggestion is to treat it like any other bug report: provide steps to reproduce it, and determine if it's a regression by finding the version(s) of the related software when it last worked.