Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    Updating to pfSense+ 24.3 breaks routing - kernel routes now gone

    Scheduled Pinned Locked Moved FRR
    51 Posts 7 Posters 4.9k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • W
      wellsouz
      last edited by

      I have the same issue here after upgrading from 23.09.1 to 24.03.1. Kernel routes just disappeared from FRR. The main problem is that I have distributed the default route to the network through OSPF. So, I cannot do it anymore and my network is broken.

      M 1 Reply Last reply Reply Quote 0
      • W wellsouz referenced this topic on
      • M
        mcury @wellsouz
        last edited by

        @wellsouz Did you guys open a redmine ticket for this issue ?

        dead on arrival, nowhere to be found.

        W 1 Reply Last reply Reply Quote 0
        • W
          wellsouz @mcury
          last edited by

          @mcury Yes. It's here.

          G 1 Reply Last reply Reply Quote 2
          • G
            Gcon @wellsouz
            last edited by

            @wellsouz Thank you for opening a ticket. Up until this point I figured I was the only one. When I'm a lone voice in the wilderness (no forum replies), I tend to just wait in the hope that it's been found and fixed elsewhere, retesting with each subsequent release.

            I am still on 23.09.1 in production (to keep my network redundancy), but have a GNS3 lab mimicking my production network which can replicate the issue. I can do some tests if there's a beta build of pfSense/FRR recommended to re-test with.

            G 1 Reply Last reply Reply Quote 0
            • G
              Gcon @Gcon
              last edited by Gcon

              I have posted on the ticket asking if the issue is at least confirmed? It doesn't look to be. It is a serious regression which needs to be assigned / confirmed / worked on / remedied. Now Netgate are posting about 24.08, and this serious issue hasn't even been looked at? I really don't think Netgate have enough resources for proper FRR support. I mean look how many tickets there are for FRR, and how old many of the tickets are. Regular maintenance on FRR tickets isn't even being done.

              1 Reply Last reply Reply Quote 1
              • M
                marcosm Netgate
                last edited by

                In case someone experiencing this issue would like to test with the latest FRR release, I've attached frr9.1.1 for amd64 platforms:

                Extract and upload the files using Diagnostics > Command Prompt, stop FRR, then install it like so:

                pkg install -fy /tmp/frr9-pythontools-9.1.1.pkg
                pkg install -fy /tmp/frr9-9.1.1.pkg
                
                M 2 Replies Last reply Reply Quote 3
                • M
                  michmoor LAYER 8 Rebel Alliance @marcosm
                  last edited by

                  @marcosm Thanks Marcos. I will try to give this a test to see

                  @Gcon In fairness here...If the problem is in the package then the problem is in upstream FreeBSD and if the problem is there then the problem needs to be corrected by the mainteners. Although I understand the frustration, netgate doesn't seem to be at fault in the regression. They are only pulling down the latest package updates.

                  That said. I do highly recommend postponing pfSense upgrades until this is rectified.

                  Firewall: NetGate,Palo Alto-VM,Juniper SRX
                  Routing: Juniper, Arista, Cisco
                  Switching: Juniper, Arista, Cisco
                  Wireless: Unifi, Aruba IAP
                  JNCIP,CCNP Enterprise

                  G 1 Reply Last reply Reply Quote 0
                  • G
                    Gcon @michmoor
                    last edited by Gcon

                    @michmoor I understand that Netgate don't develop the FRR code, but they are responsible for rolling it out to customers. Compare the following:

                    pfSense+ 23.09.1. = FRR package 2.0.2_1 / FRR 9.0.1
                    pfSenseCE 2.7.2 = FRR package 2.0.2_1 / FRR 9.0.1 <- Current CE version
                    pfSense+ 24.03 = FRR package 2.0.2_3 / FRR 9.1 <- Current Plus version

                    The 9.0.1 was working fine for me, but then who got to be the crash test dummy for the FRR 9.1 first? Plus users, not CE users!

                    If you are insinuating that since Netgate don't develop FRR (they merely package it), that they are therefor not culpable - I would say this is a logical fallacy. It's like saying the woman who gave Alec Baldwin a loaded gun which fatally shot someone on the set of that Western movie isn't culpable because well - she didn't manufacture the gun and bullets.

                    So yeah Netgate - test more thoroughly before release, and I think even more importantly - let the community edition users shoot themselves in the foot first before paying customers do. Business users prioritise stability/reliability over features and the "latest and greatest". Otherwise - why bother having CE at all? Using Plus as the guinea pigs for CE just seems like a cowboy operation (the "Wild West" metaphor just keeps on giving).

                    Anyway I will test FRR 9.1.1, and hopefully don't have to call 911 ๐Ÿ˜

                    M M 2 Replies Last reply Reply Quote 0
                    • M
                      mAineAc @Gcon
                      last edited by

                      @Gcon I am using version frr9-9.1_2 and on pfsense version 24.08.a.20240702.0600 and this is still an issue.

                      G 1 Reply Last reply Reply Quote 1
                      • M
                        michmoor LAYER 8 Rebel Alliance @Gcon
                        last edited by

                        @Gcon said in Updating to pfSense+ 24.3 breaks routing - kernel routes now gone:

                        f you are insinuating that since Netgate don't develop FRR (they merely package it), that they are therefor not culpable - I would say this is a logical fallacy. It's like saying the woman who gave Alec Baldwin a loaded gun which fatally shot someone on the set of that Western movie isn't culpable because well - she didn't manufacture the gun and bullets.

                        1. I am not insinuating anything
                        2. Your analogy, honestly, doesn't make any sense at all
                        3. I do agree that care should be taken when upgrading any package I think its probably in everyone's interest that documentation should include a warning that 3rd party packages are not tested fully for compatibility (my words) and could break the security and stability of critical systems. This isn't the first time a 3rd party package has caused production level issues. I would even go as far as not including updated binaries of these packages anytime there is a base software upgrade (23.09 > 24.03).
                          I dont know if there is an easy solution to this problem tbh.

                        Firewall: NetGate,Palo Alto-VM,Juniper SRX
                        Routing: Juniper, Arista, Cisco
                        Switching: Juniper, Arista, Cisco
                        Wireless: Unifi, Aruba IAP
                        JNCIP,CCNP Enterprise

                        1 Reply Last reply Reply Quote 1
                        • G
                          Gcon @mAineAc
                          last edited by

                          @mAineAc I upgraded my lab 23.09.1 (working fine) to 24.03 and replicated the issue again. I upgraded the two packages as per the above recommendation and I unfortunately still have the issue.

                          Then I thought about - well what else changed with 24.03? The glaringly obvious answer is the OS itself, moving to "15.0-CURRENT" 14.x was only released late last year https://en.wikipedia.org/wiki/FreeBSD
                          15.x isn't even "released" yet.

                          Maybe that is OK for apps and features built into the core of the OS. The big issue is when it comes to third-party packages like FRRouting. Why? Because the FRRouting build instructions for FreeBSD 15 simply do not exist:
                          https://docs.frrouting.org/projects/dev-guide/en/latest/building.html

                          Maybe there is more info hiding elsewhere online, but it's conspicuously missing from the page you'd expect to find it on (the building link above). For FreeBSD - it tops out at 14. Did Netgate jump the gun, cross their fingers and hope that all the third-party packages would not break? It certainly seems that way. I don't think there should be any finger pointing at FRRouting if they haven't even certified it to run on FreeBSD 15. Crazy times.

                          The term "kernel" in "kernel routes" is probably a dead giveaway that the issue stems from the OS update to 15-CURRENT. I'm not saying it is but.. where there is smoke...

                          So 9.1.1 unfortunately hasn't fixed the issue, and I suspect no version of FRR released to date will fix the issue, when paired with FreeBSD 15-CURRENT. This major OS update has changed something fundamentally in the kernel, and until FRR software gains official build support for FreeBSD 15, we are all probably wasting our time with this.

                          M 1 Reply Last reply Reply Quote 1
                          • M
                            mAineAc @Gcon
                            last edited by mAineAc

                            @Gcon What really bothers me about this is in the OSPF settings themselves. If you select to redistribute the default route always the kernel routing table should not make a difference. It should be redistributing the default route whether there is a default route or not.

                            Edit: I was just looking at the /var/etc/frr/frr.conf file and the ospf config does not even have the redistribute default command in it at all. with or without always.

                            Edit 2: Actually the originate default is in there. If that option is selected, but originate default always does not appear. If you don't have the redistribute default option selected and just the redistribute default always it will not appear in the config at all. Not sure if both should be selected, but the originate default always never appears in the config.

                            1 Reply Last reply Reply Quote 1
                            • M
                              michmoor LAYER 8 Rebel Alliance @marcosm
                              last edited by

                              @marcosm
                              Marcos l, I canโ€™t find anything upstream that indicates this is an issue (maybe I missed it). Is this an issue acknowledged by any maintainers? FRR is just in an unusable state

                              Firewall: NetGate,Palo Alto-VM,Juniper SRX
                              Routing: Juniper, Arista, Cisco
                              Switching: Juniper, Arista, Cisco
                              Wireless: Unifi, Aruba IAP
                              JNCIP,CCNP Enterprise

                              1 Reply Last reply Reply Quote 1
                              • M
                                marcosm Netgate
                                last edited by

                                Let's try to keep the discussion focused on resolving the issue. To recap a bit...

                                Reports indicate that:

                                • The issue does not happen on 23.09.1 with FRR 9.0.2.
                                • The issue is still present on 24.03 with the latest FRR version; ref 1180159.
                                • OSPF-learned default routes do not get redistributed; ref 189051.
                                • These default routes may not get added to the kernel's routing table, though sometimes they do get added; ref 1172657.

                                Some things to try next in no particular order:

                                • Test with the latest FRR version on 23.09.1 (which is on FreeBSD 14).
                                • Diff the FRR and pfSense config between 23.09.1 and 24.03. The package GUI hasn't had any changes that would affect this, but changes in pfSense itself could.
                                • Check the routing, system, and gateway logs when the issue happens (see "Syslog Logging" option in the package's Global Settings).
                                • Check if the issue is reproducible with a floating state policy (see "Firewall State Policy" in System > Advanced > Firewall & NAT). It's unlikely this would affect the issue, but it's easy an easy test.
                                M G 2 Replies Last reply Reply Quote 1
                                • M
                                  michmoor LAYER 8 Rebel Alliance @marcosm
                                  last edited by

                                  @marcosm agreed.
                                  I can spin up a vm in the next few hours but I think the folks here got a working lab now. Iโ€™ll report back with the findings

                                  Firewall: NetGate,Palo Alto-VM,Juniper SRX
                                  Routing: Juniper, Arista, Cisco
                                  Switching: Juniper, Arista, Cisco
                                  Wireless: Unifi, Aruba IAP
                                  JNCIP,CCNP Enterprise

                                  1 Reply Last reply Reply Quote 0
                                  • G
                                    Gcon @marcosm
                                    last edited by Gcon

                                    @marcosm I have done tests. I would have done them sooner but I'm in a different time zone.

                                    In my GNS3 lab I booted up my simulated ISP router then booted just the pfsense+ firewall, but no other ospf peers behind the firewall, which cuts down the amount of routes in the routing table which aren't relevant.

                                    Results:
                                    No issue on 23.09.01 with FRR 9.0.2.
                                    The issue did occur running 23.09.01 with FRR 9.1.1

                                    Note the versions and kernel routes missing on the right (in pink).
                                    84d6581c-bf67-479e-a758-dc9249e1b528-pfsplus 23.09.01 tests with different FRR.png

                                    I can boot up the pfsense+ firewall multiple times, and each time the routes are consistent as shown above. 9.0.2 always has the K routes. 9.1.1 never does - at least on boot up and with no other instability or config changes.

                                    I diffed the /var/etc/frr/frr.conf from 9.0.2 and 9.1.1 and for me - exactly the same. I haven't checked config against 24.3 but might not need to, now that I see that even 23.09.1 is affected with the package update.

                                    When I reboot the virtualized Cisco 7200 I'm using to represent my ISP in my GNS3 simulation lab, the K route shows up:
                                    e79f3960-9847-4646-94c9-78d5ab3f2858-image.png

                                    You can see from the timestamps and see I had the lab up for 20 mins before I rebooted the virtual ISP and then you can see the default route K route show up after that (in green).

                                    It looks like some race condition on boot up, because if you shake things up post-boot, you can coax some K routes out of it as I have shown. Another example. Under System > General Setup, if I change two of the DNS servers to point to "none" and save, then change it back to what I had it pointing down WAN1, I can then get two more K routes to show up. And I could probably toggle something else to get the forth and final one to show up.

                                    Obviously this is not a workable solution. As 24.3 has FRR 9.1, it does seem to me like a race condition was introduced into FRR (at least on FreeBSD14 / FreeBSD15) going from 9.0.2 to 9.1

                                    As for system logs and policy state impact - I will have a look at that next.

                                    @mAineAc I am definitely interested in analysing default-route origination and/or redistribution at some point (as that affects me too) but maybe that's a different issue not related to this apparent race condition? Not sure. I just want my K routes back like they were in 9.0.2, and when those are again reliable - I can look at the other things.


                                    EDIT: I changed "Firewall State Policy" in "System / Advanced / Firewall & NAT" from Floating States to Interface Bound States and it made no difference, so changed it back.

                                    Also it doesn't seem to be a race condition on boot up, because post-boot, if I restart FRR by either unticking then reticking it in "Services / FRR / Global Settings" or by clicking "Force Service Restart" at the bottom of that page - the issues are still there.

                                    If I tickle the things that make the K routes show up, and then restart the process, the K routes disappear again.

                                    With the fault condition in place, and the firewall having two WAN links (one to WAN1 and one to WAN2). If I log into the ISP1 router and I shut the link to the firewall WAN1, after about 20 seconds, a K route (default-route) will show up on the firewall pointing to WAN2. When I unshut the ISP interface on the ISP end, after 60 seconds the K route default-route swapped over to WAN1

                                    I have the two WANs in a Gateway Group with WAN1 as Tier1, and WAN2 as Tier2.

                                    I enabled syslog logging for FRR, with package logging level "extended". Nothing stood out to me in the logs. Tailing the logs when I reload FRR just gives me this:

                                    Aug 13 02:05:18 GTpfsense01 php-fpm[16405]: /pkg_edit.php: Configuration Change: 
                                    Aug 13 02:05:18 GTpfsense01 check_reload_status[457]: Syncing firewall
                                    Aug 13 02:05:18 GTpfsense01 php-fpm[16405]: FRR Package: FRR BGPd: No config data found.
                                    Aug 13 02:05:18 GTpfsense01 php-fpm[16405]: FRR Package: FRR OSPF6d: No config data found.
                                    Aug 13 02:05:18 GTpfsense01 php-fpm[16405]: FRR Package: FRR RIPd: No config data found.
                                    Aug 13 02:05:18 GTpfsense01 php-fpm[16405]: FRR Package: FRR BFDd: No config data found.
                                    Aug 13 02:05:18 GTpfsense01 php-fpm[16405]: FRR Package: FRR: Rebuild configuration.
                                    Aug 13 02:05:18 GTpfsense01 php-fpm[16405]: FRR Package: FRR: Daemon state: zebra: running | mgmtd: running | staticd: running | ospfd: running
                                    Aug 13 02:05:18 GTpfsense01 php-fpm[16405]: FRR Package: FRR: Service restart forced.
                                    Aug 13 02:05:18 GTpfsense01 php-fpm[16405]: FRR Package: FRR: Restarting services.
                                    Aug 13 02:05:19 GTpfsense01 staticd[16920]: [MRN6F-AYZC4] Terminating on signal
                                    Aug 13 02:05:19 GTpfsense01 mgmtd[16272]: [X3G8F-PM93W] BE-adapter: mgmt_msg_read: got EOF/disconnect
                                    Aug 13 02:05:19 GTpfsense01 mgmtd[16272]: [J2RAS-MZ95C] Terminating on signal
                                    Aug 13 02:05:20 GTpfsense01 mgmtd[87833]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
                                    Aug 13 02:05:20 GTpfsense01 staticd[88655]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
                                    
                                    1 Reply Last reply Reply Quote 0
                                    • M
                                      marcosm Netgate
                                      last edited by

                                      Ruling out changes between pfSense versions is very helpful. Before we can say it's an issue in FRR or even a bug at all, let's try to understand what exactly is happening. I'm hoping the logs will have a clue.

                                      G 1 Reply Last reply Reply Quote 1
                                      • G
                                        Gcon @marcosm
                                        last edited by Gcon

                                        @marcosm I just edited my note with some extra info. Is it possible to get packages built for 10.1 and I can test those? https://www.frrouting.org/release/. Or even 9.0.3, just to confirm even more that the issue stems from the 9.0.x jump to 9.1.x.

                                        1 Reply Last reply Reply Quote 0
                                        • M
                                          marcosm Netgate
                                          last edited by

                                          Here's frr9.0.3.

                                          G 1 Reply Last reply Reply Quote 0
                                          • G
                                            Gcon @marcosm
                                            last edited by Gcon

                                            @marcosm Thanks. 9.0.3 is still fine. Straight after boot up:

                                            [23.09.1-RELEASE][root@GTpfsense01.<<hidden>>]/root: vtysh
                                            
                                            Hello, this is FRRouting (version 9.0.3).
                                            Copyright 1996-2005 Kunihiro Ishiguro, et al.
                                            
                                            GTpfsense01.<<hidden>># show ip route
                                            Codes: K - kernel route, C - connected, S - static, R - RIP,
                                                   O - OSPF, I - IS-IS, B - BGP, E - EIGRP, T - Table,
                                                   v - VNC, V - VNC-Direct, A - Babel, f - OpenFabric,
                                                   > - selected route, * - FIB route, q - queued, r - rejected, b - backup
                                                   t - trapped, o - offload failure
                                            
                                            K>* 0.0.0.0/0 [0/0] via <<hidden>>, vmx2, 00:00:53
                                            C>* 10.27.10.0/24 [0/1] is directly connected, vmx1.10, 00:00:53
                                            C>* 10.27.194.0/24 [0/1] is directly connected, ovpns1, 00:00:53
                                            C>* 10.30.20.0/24 [0/1] is directly connected, vmx1.20, 00:00:53
                                            C>* 10.254.40.0/28 [0/1] is directly connected, vmx1.40, 00:00:53
                                            C>* 10.254.100.0/24 [0/1] is directly connected, vmx1.100, 00:00:53
                                            C>* 10.255.195.2/32 [0/1] is directly connected, ovpns2, 00:00:53
                                            C>* 10.255.196.2/32 [0/1] is directly connected, ovpns3, 00:00:53
                                            C>* 10.255.197.2/32 [0/1] is directly connected, ovpns4, 00:00:53
                                            K>* <<hidden>>/32 [0/0] via <<hidden>>, vmx2, 00:00:53
                                            C>* <<hidden>>/29 [0/1] is directly connected, vmx2, 00:00:53
                                            C>* <<hidden>>/22 [0/1] is directly connected, vmx3, 00:00:53
                                            C>* 172.16.27.1/32 [0/1] is directly connected, lo0, 00:00:53
                                            C>* 192.168.57.0/24 [0/1] is directly connected, vmx0, 00:00:53
                                            K>* 203.12.160.35/32 [0/0] via <<hidden>>, vmx2, 00:00:53
                                            K>* 203.12.160.36/32 [0/0] via <<hidden>>, vmx2, 00:00:53
                                            GTpfsense01.<<hidden>># 
                                            

                                            This was about a minute after boot up. No issues with K routes. All 4 expected ones - most crucially the default - are all there.

                                            So something in the 9.1.x series broke K routes on FreeBSD14/15 when starting the FRR service. If you "tickle" things when the service is running - like a remote interface shutdown and unshut, or config change and change-back. The K routes can be coaxed out, but obviously - this is not workable/practical.

                                            https://frrouting.org/release/9.1/
                                            "FRR 9.1 brings a long list of enhancements and fixes with 941 commits from 73 developers."

                                            I scanned the CI
                                            https://ci1.netdef.org/browse/FRR-FRR121/

                                            Then scanned the tests for FreeBSD
                                            https://ci1.netdef.org/browse/FRR-FRR121-FBSD14AMD-101/test

                                            That all seems to be just BGP specific. There doesn't seem to be any CI tests specifically for FreeBSD and this functionality of K routes. No wonder regressions come in like this - if no one is testing for it. Geez, what a nightmare - trying to find out which of the 941 commits to 9.1 broke it on FreeBSD.

                                            Maybe Alexander Skorichenko askorichenko@netgate.com can provide some input, as he signed off on one of the changes backported to 9.1 https://ci1.netdef.org/browse/FRR-FRR121-54

                                            Do you think it could be related to my NIC type? I am using VMware vmxnet3 for both production and lab. I can rebuild in my GNS3 lab as igbX NICs to see if that changes anything. Mind you in production I am tied to vmxnet3 as anything other than the paravirtualized vmxnet3 NICs give comparatively poor performance (the alternative being e1000e but that does not perform well at scale). So it would only be for information gathering. vmxnet3 works wonderfully for FRR 9.0.x shouldn't shouldn't have to change virtualized NIC types because someone broke vmxnet3. But I'll test anyway and see what I get.

                                            1 Reply Last reply Reply Quote 1
                                            • First post
                                              Last post
                                            Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.