Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    Updating to pfSense+ 24.3 breaks routing - kernel routes now gone

    Scheduled Pinned Locked Moved FRR
    51 Posts 7 Posters 13.6k Views 7 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • G Offline
      Gcon @marcosm
      last edited by Gcon

      @marcosm I have done tests. I would have done them sooner but I'm in a different time zone.

      In my GNS3 lab I booted up my simulated ISP router then booted just the pfsense+ firewall, but no other ospf peers behind the firewall, which cuts down the amount of routes in the routing table which aren't relevant.

      Results:
      No issue on 23.09.01 with FRR 9.0.2.
      The issue did occur running 23.09.01 with FRR 9.1.1

      Note the versions and kernel routes missing on the right (in pink).
      84d6581c-bf67-479e-a758-dc9249e1b528-pfsplus 23.09.01 tests with different FRR.png

      I can boot up the pfsense+ firewall multiple times, and each time the routes are consistent as shown above. 9.0.2 always has the K routes. 9.1.1 never does - at least on boot up and with no other instability or config changes.

      I diffed the /var/etc/frr/frr.conf from 9.0.2 and 9.1.1 and for me - exactly the same. I haven't checked config against 24.3 but might not need to, now that I see that even 23.09.1 is affected with the package update.

      When I reboot the virtualized Cisco 7200 I'm using to represent my ISP in my GNS3 simulation lab, the K route shows up:
      e79f3960-9847-4646-94c9-78d5ab3f2858-image.png

      You can see from the timestamps and see I had the lab up for 20 mins before I rebooted the virtual ISP and then you can see the default route K route show up after that (in green).

      It looks like some race condition on boot up, because if you shake things up post-boot, you can coax some K routes out of it as I have shown. Another example. Under System > General Setup, if I change two of the DNS servers to point to "none" and save, then change it back to what I had it pointing down WAN1, I can then get two more K routes to show up. And I could probably toggle something else to get the forth and final one to show up.

      Obviously this is not a workable solution. As 24.3 has FRR 9.1, it does seem to me like a race condition was introduced into FRR (at least on FreeBSD14 / FreeBSD15) going from 9.0.2 to 9.1

      As for system logs and policy state impact - I will have a look at that next.

      @mAineAc I am definitely interested in analysing default-route origination and/or redistribution at some point (as that affects me too) but maybe that's a different issue not related to this apparent race condition? Not sure. I just want my K routes back like they were in 9.0.2, and when those are again reliable - I can look at the other things.


      EDIT: I changed "Firewall State Policy" in "System / Advanced / Firewall & NAT" from Floating States to Interface Bound States and it made no difference, so changed it back.

      Also it doesn't seem to be a race condition on boot up, because post-boot, if I restart FRR by either unticking then reticking it in "Services / FRR / Global Settings" or by clicking "Force Service Restart" at the bottom of that page - the issues are still there.

      If I tickle the things that make the K routes show up, and then restart the process, the K routes disappear again.

      With the fault condition in place, and the firewall having two WAN links (one to WAN1 and one to WAN2). If I log into the ISP1 router and I shut the link to the firewall WAN1, after about 20 seconds, a K route (default-route) will show up on the firewall pointing to WAN2. When I unshut the ISP interface on the ISP end, after 60 seconds the K route default-route swapped over to WAN1

      I have the two WANs in a Gateway Group with WAN1 as Tier1, and WAN2 as Tier2.

      I enabled syslog logging for FRR, with package logging level "extended". Nothing stood out to me in the logs. Tailing the logs when I reload FRR just gives me this:

      Aug 13 02:05:18 GTpfsense01 php-fpm[16405]: /pkg_edit.php: Configuration Change: 
      Aug 13 02:05:18 GTpfsense01 check_reload_status[457]: Syncing firewall
      Aug 13 02:05:18 GTpfsense01 php-fpm[16405]: FRR Package: FRR BGPd: No config data found.
      Aug 13 02:05:18 GTpfsense01 php-fpm[16405]: FRR Package: FRR OSPF6d: No config data found.
      Aug 13 02:05:18 GTpfsense01 php-fpm[16405]: FRR Package: FRR RIPd: No config data found.
      Aug 13 02:05:18 GTpfsense01 php-fpm[16405]: FRR Package: FRR BFDd: No config data found.
      Aug 13 02:05:18 GTpfsense01 php-fpm[16405]: FRR Package: FRR: Rebuild configuration.
      Aug 13 02:05:18 GTpfsense01 php-fpm[16405]: FRR Package: FRR: Daemon state: zebra: running | mgmtd: running | staticd: running | ospfd: running
      Aug 13 02:05:18 GTpfsense01 php-fpm[16405]: FRR Package: FRR: Service restart forced.
      Aug 13 02:05:18 GTpfsense01 php-fpm[16405]: FRR Package: FRR: Restarting services.
      Aug 13 02:05:19 GTpfsense01 staticd[16920]: [MRN6F-AYZC4] Terminating on signal
      Aug 13 02:05:19 GTpfsense01 mgmtd[16272]: [X3G8F-PM93W] BE-adapter: mgmt_msg_read: got EOF/disconnect
      Aug 13 02:05:19 GTpfsense01 mgmtd[16272]: [J2RAS-MZ95C] Terminating on signal
      Aug 13 02:05:20 GTpfsense01 mgmtd[87833]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
      Aug 13 02:05:20 GTpfsense01 staticd[88655]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
      
      1 Reply Last reply Reply Quote 0
      • M Offline
        marcosm Netgate
        last edited by

        Ruling out changes between pfSense versions is very helpful. Before we can say it's an issue in FRR or even a bug at all, let's try to understand what exactly is happening. I'm hoping the logs will have a clue.

        G 1 Reply Last reply Reply Quote 1
        • G Offline
          Gcon @marcosm
          last edited by Gcon

          @marcosm I just edited my note with some extra info. Is it possible to get packages built for 10.1 and I can test those? https://www.frrouting.org/release/. Or even 9.0.3, just to confirm even more that the issue stems from the 9.0.x jump to 9.1.x.

          1 Reply Last reply Reply Quote 0
          • M Offline
            marcosm Netgate
            last edited by

            Here's frr9.0.3.

            G 1 Reply Last reply Reply Quote 0
            • G Offline
              Gcon @marcosm
              last edited by Gcon

              @marcosm Thanks. 9.0.3 is still fine. Straight after boot up:

              [23.09.1-RELEASE][root@GTpfsense01.<<hidden>>]/root: vtysh
              
              Hello, this is FRRouting (version 9.0.3).
              Copyright 1996-2005 Kunihiro Ishiguro, et al.
              
              GTpfsense01.<<hidden>># show ip route
              Codes: K - kernel route, C - connected, S - static, R - RIP,
                     O - OSPF, I - IS-IS, B - BGP, E - EIGRP, T - Table,
                     v - VNC, V - VNC-Direct, A - Babel, f - OpenFabric,
                     > - selected route, * - FIB route, q - queued, r - rejected, b - backup
                     t - trapped, o - offload failure
              
              K>* 0.0.0.0/0 [0/0] via <<hidden>>, vmx2, 00:00:53
              C>* 10.27.10.0/24 [0/1] is directly connected, vmx1.10, 00:00:53
              C>* 10.27.194.0/24 [0/1] is directly connected, ovpns1, 00:00:53
              C>* 10.30.20.0/24 [0/1] is directly connected, vmx1.20, 00:00:53
              C>* 10.254.40.0/28 [0/1] is directly connected, vmx1.40, 00:00:53
              C>* 10.254.100.0/24 [0/1] is directly connected, vmx1.100, 00:00:53
              C>* 10.255.195.2/32 [0/1] is directly connected, ovpns2, 00:00:53
              C>* 10.255.196.2/32 [0/1] is directly connected, ovpns3, 00:00:53
              C>* 10.255.197.2/32 [0/1] is directly connected, ovpns4, 00:00:53
              K>* <<hidden>>/32 [0/0] via <<hidden>>, vmx2, 00:00:53
              C>* <<hidden>>/29 [0/1] is directly connected, vmx2, 00:00:53
              C>* <<hidden>>/22 [0/1] is directly connected, vmx3, 00:00:53
              C>* 172.16.27.1/32 [0/1] is directly connected, lo0, 00:00:53
              C>* 192.168.57.0/24 [0/1] is directly connected, vmx0, 00:00:53
              K>* 203.12.160.35/32 [0/0] via <<hidden>>, vmx2, 00:00:53
              K>* 203.12.160.36/32 [0/0] via <<hidden>>, vmx2, 00:00:53
              GTpfsense01.<<hidden>># 
              

              This was about a minute after boot up. No issues with K routes. All 4 expected ones - most crucially the default - are all there.

              So something in the 9.1.x series broke K routes on FreeBSD14/15 when starting the FRR service. If you "tickle" things when the service is running - like a remote interface shutdown and unshut, or config change and change-back. The K routes can be coaxed out, but obviously - this is not workable/practical.

              https://frrouting.org/release/9.1/
              "FRR 9.1 brings a long list of enhancements and fixes with 941 commits from 73 developers."

              I scanned the CI
              https://ci1.netdef.org/browse/FRR-FRR121/

              Then scanned the tests for FreeBSD
              https://ci1.netdef.org/browse/FRR-FRR121-FBSD14AMD-101/test

              That all seems to be just BGP specific. There doesn't seem to be any CI tests specifically for FreeBSD and this functionality of K routes. No wonder regressions come in like this - if no one is testing for it. Geez, what a nightmare - trying to find out which of the 941 commits to 9.1 broke it on FreeBSD.

              Maybe Alexander Skorichenko askorichenko@netgate.com can provide some input, as he signed off on one of the changes backported to 9.1 https://ci1.netdef.org/browse/FRR-FRR121-54

              Do you think it could be related to my NIC type? I am using VMware vmxnet3 for both production and lab. I can rebuild in my GNS3 lab as igbX NICs to see if that changes anything. Mind you in production I am tied to vmxnet3 as anything other than the paravirtualized vmxnet3 NICs give comparatively poor performance (the alternative being e1000e but that does not perform well at scale). So it would only be for information gathering. vmxnet3 works wonderfully for FRR 9.0.x shouldn't shouldn't have to change virtualized NIC types because someone broke vmxnet3. But I'll test anyway and see what I get.

              1 Reply Last reply Reply Quote 1
              • M Offline
                marcosm Netgate
                last edited by marcosm

                Thank you for testing. Let's try our luck with frr10.1 then. We can determine the next steps after that.

                G 1 Reply Last reply Reply Quote 2
                • G Offline
                  Gcon @marcosm
                  last edited by Gcon

                  @marcosm Thank you so much for your package builds.

                  9.0.3 is on the left, and 10.1 is on the right:
                  dd8ba9f6-4a06-4640-b70f-61f0ac30c9a1-image.png

                  In 10.1, these L routes (local) show up. First time I've seen those in FRR. They are some interface IPs. A few OSPF routes show up in that as well. Unfortunately no K routes.

                  The "tickle techniques" I have mentioned previously can still be used to coax the K routes back.

                  My current suspicions are that it's something to do with something changing from 9.0.x to 9.1 in relation to vmxnet3 NICs. I tried rebuilding to e1000e "emX" NICs but that didn't go so well - I remapped and that took a long time to process - about half an hour - and then after reboot, things seem to get stuck on loading VLAN interfaces.

                  I wouldn't want to have to change NIC types in production. Especially when vmxnet3 should work fine, and is the recommended type for VMware/ESXi https://docs.netgate.com/pfsense/en/latest/recipes/virtualize-esxi.html

                  I have always had Hardware TSO and LRO disabled. I tried also disabling hardware checksum offload but that didn't help.

                  EDIT: In the same lab I have a pfSense 2.7.2 CE box emulating Intel 82545EM NIC (similar to e1000e) and I upgraded that to FRR 10.1, and the kernel routes were still there afterwards. I am suspecting that FRR from 9.1 onwards doesn't play well with vmxnet3 paravirtualized NICs on FreeBSD14/15.

                  1 Reply Last reply Reply Quote 0
                  • M Offline
                    marcosm Netgate
                    last edited by marcosm

                    I wouldn't think the interface driver would matter in this case, but I suppose it's possible. Anyway, we'll try to get FRR10 in for 24.08. Given that testing shows this is more likely to be an issue with the package, I suggest looking through the upstream issues and opening an issue report there (and link it on the redmine).

                    For now, I will attach add frr9.0.3 to the redmine to serve as a temporary workaround for those that need it.

                    G 1 Reply Last reply Reply Quote 1
                    • keyserK keyser referenced this topic on
                    • G Offline
                      Gcon @marcosm
                      last edited by

                      @marcosm I did some more testing, and you were right in thinking that the interface driver wouldn't matter in this case. I wanted confirmation though, so I made a completely separate GNS3 lab so that I wouldn't have to keep blacking out my WAN addresses and FQDN's. This is the lab:
                      5006bc8d-e377-4cc4-808e-5aa2cde9109a-image.png

                      The clouds are just so I can get HTTP access to the devices, and cloud1 also provides Internet to the lab, for the initial FRR package installs.

                      Both firewalls are pfSense CE 2.7.2 fresh installs.
                      They only differ in hardware for NIC types.
                      Firewall 1 emulates Intel 82576 (igb#)
                      Firewall 2 emulates VMware vmxnet3 (vmx#)

                      I checked the route tables in FRR, when running the default FRR 9.0.2. and there was a single K route for both. I would assume that FRR 9.0.3 also has this, given my previous testing:

                      [2.7.2-RELEASE][root@firewall1.home.arpa]/root: vtysh
                      
                      Hello, this is FRRouting (version 9.0.2).
                      Copyright 1996-2005 Kunihiro Ishiguro, et al.
                      
                      firewall1.home.arpa# show ip route
                      Codes: K - kernel route, C - connected, S - static, R - RIP,
                             O - OSPF, I - IS-IS, B - BGP, E - EIGRP, T - Table,
                             v - VNC, V - VNC-Direct, A - Babel, f - OpenFabric,
                             > - selected route, * - FIB route, q - queued, r - rejected, b - backup
                             t - trapped, o - offload failure
                      
                      K>* 0.0.0.0/0 [0/0] via 192.0.2.1, igb0, 00:00:21
                      C>* 192.0.2.0/29 [0/1] is directly connected, igb0, 00:00:21
                      C>* 192.168.41.0/24 [0/1] is directly connected, igb1, 00:00:21
                      C>* 192.168.57.0/24 [0/1] is directly connected, igb5, 00:00:21
                      firewall1.home.arpa# 
                      
                      
                      [2.7.2-RELEASE][root@firewall2.home.arpa]/root: vtysh
                      
                      Hello, this is FRRouting (version 9.0.2).
                      Copyright 1996-2005 Kunihiro Ishiguro, et al.
                      
                      firewall2.home.arpa# show ip route
                      Codes: K - kernel route, C - connected, S - static, R - RIP,
                             O - OSPF, I - IS-IS, B - BGP, E - EIGRP, T - Table,
                             v - VNC, V - VNC-Direct, A - Babel, f - OpenFabric,
                             > - selected route, * - FIB route, q - queued, r - rejected, b - backup
                             t - trapped, o - offload failure
                      
                      K>* 0.0.0.0/0 [0/0] via 192.0.2.9, vmx0, 00:00:36
                      C>* 192.0.2.8/29 [0/1] is directly connected, vmx0, 00:00:36
                      C>* 192.168.42.0/24 [0/1] is directly connected, vmx1, 00:00:36
                      C>* 192.168.57.0/24 [0/1] is directly connected, vmx5, 00:00:36
                      firewall2.home.arpa# 
                      

                      I then stopped FRR, and installed FRR 9.1.1. Results:

                      [2.7.2-RELEASE][root@firewall1.home.arpa]/tmp: vtysh
                      
                      Hello, this is FRRouting (version 9.1.1).
                      Copyright 1996-2005 Kunihiro Ishiguro, et al.
                      
                      firewall1.home.arpa# show ip route
                      Codes: K - kernel route, C - connected, S - static, R - RIP,
                             O - OSPF, I - IS-IS, B - BGP, E - EIGRP, T - Table,
                             v - VNC, V - VNC-Direct, A - Babel, f - OpenFabric,
                             > - selected route, * - FIB route, q - queued, r - rejected, b - backup
                             t - trapped, o - offload failure
                      
                      C>* 192.0.2.0/29 [0/1] is directly connected, igb0, 00:00:10
                      C>* 192.168.41.0/24 [0/1] is directly connected, igb1, 00:00:10
                      C>* 192.168.57.0/24 [0/1] is directly connected, igb5, 00:00:10
                      firewall1.home.arpa# 
                      
                      
                      [2.7.2-RELEASE][root@firewall2.home.arpa]/tmp: vtysh
                      
                      Hello, this is FRRouting (version 9.1.1).
                      Copyright 1996-2005 Kunihiro Ishiguro, et al.
                      
                      firewall2.home.arpa# show ip route
                      Codes: K - kernel route, C - connected, S - static, R - RIP,
                             O - OSPF, I - IS-IS, B - BGP, E - EIGRP, T - Table,
                             v - VNC, V - VNC-Direct, A - Babel, f - OpenFabric,
                             > - selected route, * - FIB route, q - queued, r - rejected, b - backup
                             t - trapped, o - offload failure
                      
                      C>* 192.0.2.8/29 [0/1] is directly connected, vmx0, 00:01:19
                      C>* 192.168.42.0/24 [0/1] is directly connected, vmx1, 00:01:19
                      C>* 192.168.57.0/24 [0/1] is directly connected, vmx5, 00:01:19
                      firewall2.home.arpa# 
                      

                      So yes - NIC type doesn't seem to matter. Well there's over 20 NIC types to choose from in GNS3... but I'll just stop at 2 for now. We know it is at least not limited to vmxnet3. Also not related to OSPF as I am not running it in this lab. I am not running any routing protocols. I also checked the routing tables in FRR 9.1.1 about 20 minutes afterwards, and still no K routes. I wasn't even able to coax the K routes to come out by rebooting ISP1 and ISP2. In this lab, I only have a single WAN on each firewall, unlike my production lab, where I have dual WAN links.

                      Thanks for the link to the upstream issues. I'll go through them and see if it's known. If not then I'll log the issue.

                      1 Reply Last reply Reply Quote 0
                      • M Offline
                        marcosm Netgate
                        last edited by

                        If you're not running any dynamic routing on that lab and the default route is still missing - where is it supposed to come from? Is it just defined as a static route in Zebra?

                        M G 2 Replies Last reply Reply Quote 0
                        • M Offline
                          michmoor LAYER 8 Rebel Alliance @marcosm
                          last edited by

                          @marcosm I have updated 15623 with my findings. 9.0.3 works without issue for displaying 0/0 route

                          Firewall: NetGate,Palo Alto-VM,Juniper SRX
                          Routing: Juniper, Arista, Cisco
                          Switching: Juniper, Arista, Cisco
                          Wireless: Unifi, Aruba IAP
                          JNCIP,CCNP Enterprise

                          1 Reply Last reply Reply Quote 1
                          • G Offline
                            Gcon @marcosm
                            last edited by Gcon

                            @marcosm I don't know why the default route still works when it goes missing in FRR. Perhaps it falls back to the system/kernel routing table for any destinations not found in the FRR RIB/FIB?

                            [2.7.2-RELEASE][root@firewall1.home.arpa]/root: vtysh
                            
                            Hello, this is FRRouting (version 9.1.1).
                            Copyright 1996-2005 Kunihiro Ishiguro, et al.
                            
                            firewall1.home.arpa# show ip route
                            Codes: K - kernel route, C - connected, S - static, R - RIP,
                                   O - OSPF, I - IS-IS, B - BGP, E - EIGRP, T - Table,
                                   v - VNC, V - VNC-Direct, A - Babel, f - OpenFabric,
                                   > - selected route, * - FIB route, q - queued, r - rejected, b - backup
                                   t - trapped, o - offload failure
                            
                            C>* 192.0.2.0/29 [0/1] is directly connected, igb0, 00:12:30
                            C>* 192.168.41.0/24 [0/1] is directly connected, igb1, 00:12:30
                            C>* 192.168.57.0/24 [0/1] is directly connected, igb5, 00:12:30
                            firewall1.home.arpa# show ip fib
                            Codes: K - kernel route, C - connected, S - static, R - RIP,
                                   O - OSPF, I - IS-IS, B - BGP, E - EIGRP, T - Table,
                                   v - VNC, V - VNC-Direct, A - Babel, f - OpenFabric,
                                   > - selected route, * - FIB route, q - queued, r - rejected, b - backup
                                   t - trapped, o - offload failure
                            
                            C>* 192.0.2.0/29 [0/1] is directly connected, igb0, 00:12:33
                            C>* 192.168.41.0/24 [0/1] is directly connected, igb1, 00:12:33
                            C>* 192.168.57.0/24 [0/1] is directly connected, igb5, 00:12:33
                            firewall1.home.arpa# exit
                            [2.7.2-RELEASE][root@firewall1.home.arpa]/root: netstat -rn4
                            Routing tables
                            
                            Internet:
                            Destination        Gateway            Flags     Netif Expire
                            default            192.0.2.1          UGS        igb0
                            127.0.0.1          link#8             UH          lo0
                            192.0.2.0/29       link#1             U          igb0
                            192.0.2.2          link#8             UHS         lo0
                            192.168.41.0/24    link#2             U          igb1
                            192.168.41.1       link#8             UHS         lo0
                            192.168.57.0/24    link#6             U          igb5
                            192.168.57.204     link#8             UHS         lo0
                            [2.7.2-RELEASE][root@firewall1.home.arpa]/root: traceroute 8.8.8.8
                            traceroute to 8.8.8.8 (8.8.8.8), 64 hops max, 40 byte packets
                             1  192.0.2.1 (192.0.2.1)  1.736 ms  1.638 ms  1.182 ms
                             2  192.168.1.1 (192.168.1.1)  1.958 ms  2.197 ms  1.552 ms
                             3  192.168.57.1 (192.168.57.1)  2.678 ms  2.682 ms  2.554 ms
                             4  10.231.48.10 (10.231.48.10)  23.731 ms  22.466 ms  21.816 ms
                             5  ae10.chw-ice301.sydney.telstra.net (203.50.61.65)  22.058 ms  22.454 ms
                                ae10.ken-ice301.sydney.telstra.net (203.50.61.81)  24.173 ms
                             6  bundle-ether25.hay-core30.sydney.telstra.net (203.50.61.80)  22.530 ms
                                bundle-ether25.stl-core30.sydney.telstra.net (203.50.61.64)  24.026 ms
                                bundle-ether25.hay-core30.sydney.telstra.net (203.50.61.80)  23.712 ms
                             7  bundle-ether1.chw-edge903.sydney.telstra.net (203.50.11.177)  22.088 ms  22.215 ms  22.276 ms
                             8  goo2503144.lnk.telstra.net (58.163.91.202)  23.689 ms
                                goo2503069.lnk.telstra.net (58.163.91.194)  22.960 ms
                                72.14.212.22 (72.14.212.22)  23.076 ms
                             9  192.178.97.87 (192.178.97.87)  24.997 ms
                                192.178.98.33 (192.178.98.33)  23.710 ms
                                192.178.98.21 (192.178.98.21)  23.375 ms
                            10  142.251.64.177 (142.251.64.177)  23.976 ms
                                142.251.64.179 (142.251.64.179)  23.778 ms
                                216.239.56.69 (216.239.56.69)  23.841 ms
                            11  dns.google (8.8.8.8)  27.750 ms  23.449 ms  23.632 ms
                            

                            In my basic lab here, I just have the gateway configured like so:
                            e2d815e3-a1ca-4382-97ed-80e9d1ff95d8-image.png

                            Just to recap the issue in my production network: Since that default gateway / (default route) doesn't show up as a kernel route in FRR (from FRR 9.1 and onwards), the situation is that when the pfSense firewall with this affliction learns a default route via OSPF from another router device behind it (on the firewall "LAN"), the whole network Internet traffic still arrives at the pfSense firewall, because it is still advertising the default route into the network and has the lowest cost, but then the pfSense then decides to send the traffic back to the LAN router - back to where it came from. It crazily thinks that some high-cost OSPF route is a better option than its directly connected default gateway, and it shouldn't. It was working fine before that. Fine in 9.0.2 / 9.0.3.

                            I guess not too many people run OSPF on their network with a competing default route? Otherwise people would be screaming about this issue all over the place. It would be broken for everyone right now on the new code, but the breakage only affects certain topologies.

                            The secondary default route I have is across a 700Mbps microwave link to a remote site that also has an pfSense firewall with a fairly low-speed Internet link. If the main site's internet goes down, then the customer traffic ends up going to the LAN router (Mikrotik Cloud Core LAN router) as normal and then takes the high-cost OSPF route across the microwave link to the remote site and still have Internet (vs the low-cost OSPF route to the local pfSense router).

                            I fixed this issue on the day by turning off the remote firewall's advertisement of OSPF (put in a temporary static route where needed). Then afterhours I used the Netgate Installer to reinstall the previous version of pfSense Plus, to get the earlier FRR back. Now both are advertising their default routes and redundancy is fine. Just want to get this issue solved so I can upgrade the main site to pfSense Plus 24.3... 24.8 etc.

                            1 Reply Last reply Reply Quote 0
                            • M Offline
                              marcosm Netgate
                              last edited by

                              To clarify, is the default route is missing from both Zebra and the kernel, or just Zebra?

                              It crazily thinks that some high-cost OSPF route is a better option than its directly connected default gateway, and it shouldn't.

                              Is this happening while the lower-cost route exists in the kernel? If so, is that happening with newly established traffic as well (as in not traffic for which states already exist)?

                              G 1 Reply Last reply Reply Quote 0
                              • M Offline
                                marcosm Netgate
                                last edited by

                                Please test this patched frr 9.1 version and let us know if the issue persists.

                                M 1 Reply Last reply Reply Quote 0
                                • G Offline
                                  Gcon @marcosm
                                  last edited by

                                  @marcosm I tested that in my production lab by upgrading the lab PfSense Plus 23.x to 24.x and seeing the breakage (K routes disappearing), and then I stopped FRR and applied that patched version, and started it again - Kernel routes showing up. Rebooted - still have the K routes.

                                  Is there a bug reference ID you can link to? I'm really curious! I've spent days on this and would love to find out.

                                  Would you recommend I use this in production? Maybe I am best waiting for 24.8 - where perhaps an updated FRR build will have more testing? Then I can skip 24.3 altogether and just go straight to 24.8.

                                  This patched version has this:

                                  configured with:
                                      '--enable-user=frr' '--enable-group=frr' '--enable-vty-group=frrvty' '--enable-vtysh' '--disable-doc-html' '--sysconfdir=/var/etc/frr' '--localstatedir=/var/run/frr' '--disable-nhrpd' '--disable-pathd' '--disable-ospfclient' '--disable-pimd' '--disable-pbrd' '--with-vtysh-pager=cat' '--enable-backtrace' '--disable-config-rollbacks' '--disable-datacenter' '--enable-fpm' '--disable-ldpd' '--disable-doc' '--without-libpam' '--enable-rpki' '--disable-sharpd' '--disable-shell-access' '--enable-snmp' '--disable-tcmalloc' '--prefix=/usr/local' '--mandir=/usr/local/man' '--disable-silent-rules' '--infodir=/usr/local/share/info/' '--build=amd64-portbld-freebsd15.0' 'build_alias=amd64-portbld-freebsd15.0' 'PKG_CONFIG=pkgconf' 'PKG_CONFIG_LIBDIR=/wrkdirs/usr/ports/net/frr9/work/.pkgconfig:/usr/local/libdata/pkgconfig:/usr/local/share/pkgconfig:/usr/libdata/pkgconfig' 'CC=cc' 'CFLAGS=-O2 -pipe -fstack-protector-strong -fno-strict-aliasing ' 'LDFLAGS= -L/usr/local/lib -L/usr/local/lib -fstack-protector-strong ' 'LIBS=' 'CPPFLAGS=-I/usr/local/include -I/usr/local/include' 'CPP=cpp' 'CXX=c++' 'CXXFLAGS=-O2 -pipe -fstack-protector-strong -fno-strict-aliasing ' 'PYTHON=/usr/local/bin/python3.11'
                                  

                                  I compared it to other builds and nothing stands out. SNMP was off in one of the builds (for CE) and one of the other builds had "--mandir=/usr/local/share/man" instead of "--mandir=/usr/local/man" so am thinking that the fix was more than just build config.

                                  In case this info is still required.... even though the root cause seems to have been identified/fixed....

                                  To clarify, is the default route is missing from both Zebra and the kernel, or just Zebra?

                                  The default route was missing just from Zebra. It was in the kernel.

                                  Is this happening while the lower-cost route exists in the kernel?

                                  Yes that's right.

                                  If so, is that happening with newly established traffic as well (as in not traffic for which states already exist)?

                                  Yes. Internet web browsing to new websites was broken. Traffic would go from a workstation to the Mikrotik cloud core LAN router. The Mikrotik could see the default route to the local pfsense, and also a default route over the microwave link to the other site's pfsense. The microwave link has a high OSPF cost, so the LAN router would correctly send the Internet traffic to the local pfSense. But then the local pfSense had an OSPF-learned route to the remote site over the microwave link and no K route for the local connected gateway, and bounced the traffic back to the LAN router, which then sent it back to the local pfsense . Can see that with traceroutes - traffic oscillating between firewall and LAN router until TTL timeout.

                                  I don't know how it all works, but my experience suggests that if a route exists in Zebra and is subsequently added to the Zebra FIB, then this is the forwarding that gets used. If Zebra has no RIB/FIB entry, then it falls back to the system RIB/FIB (as given by "netstat -rn") before failing. This layering would make sense so that Zebra can start and stop with the least amount of impact. It's a massive danger though when the kernel routes don't get pushed from system/kernel to Zebra, because an incomplete view can lead to extremely poor routing decisions.

                                  1 Reply Last reply Reply Quote 0
                                  • M Offline
                                    marcosm Netgate
                                    last edited by

                                    We found what looks to be the root cause - info has been posted to the Redmine report.

                                    The route redistribution issue still needs testing with the patched version, any help with that would be appreciated.

                                    I suggest waiting until we pick back the fix to 24.03 for your production systems.

                                    1 Reply Last reply Reply Quote 1
                                    • M Offline
                                      mAineAc @marcosm
                                      last edited by

                                      @marcosm said in Updating to pfSense+ 24.3 breaks routing - kernel routes now gone:

                                      Please test this patched frr 9.1 version and let us know if the issue persists.

                                      How do you install this? Sorry pretty new. Can I just scp this to my netgate 7100 and use some sort of package manager to install? Any particular process that won't break further releases?

                                      M 1 Reply Last reply Reply Quote 0
                                      • M Offline
                                        marcosm Netgate @mAineAc
                                        last edited by

                                        @mAineAc See the previous comment.

                                        M 1 Reply Last reply Reply Quote 0
                                        • M Offline
                                          mAineAc @marcosm
                                          last edited by

                                          @marcosm Yeah, after installing no change. rebooted no change. I don't see the default route in FRR and it is not redistributing the default route.

                                          M 1 Reply Last reply Reply Quote 0
                                          • M Offline
                                            marcosm Netgate @mAineAc
                                            last edited by marcosm

                                            @mAineAc Try to rule out configuration issues by verifying what version it last worked on.

                                            @Gcon The updated frr9 package is now available in 24.03. You can pull in the update by running pfSense-upgrade in the CLI. Please let us know if it works on your system(s).

                                            G M 2 Replies Last reply Reply Quote 1
                                            • First post
                                              Last post
                                            Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.