Updating to pfSense+ 24.3 breaks routing - kernel routes now gone
- 
 @marcosm I did some more testing, and you were right in thinking that the interface driver wouldn't matter in this case. I wanted confirmation though, so I made a completely separate GNS3 lab so that I wouldn't have to keep blacking out my WAN addresses and FQDN's. This is the lab: 
  The clouds are just so I can get HTTP access to the devices, and cloud1 also provides Internet to the lab, for the initial FRR package installs. Both firewalls are pfSense CE 2.7.2 fresh installs. 
 They only differ in hardware for NIC types.
 Firewall 1 emulates Intel 82576 (igb#)
 Firewall 2 emulates VMware vmxnet3 (vmx#)I checked the route tables in FRR, when running the default FRR 9.0.2. and there was a single K route for both. I would assume that FRR 9.0.3 also has this, given my previous testing: [2.7.2-RELEASE][root@firewall1.home.arpa]/root: vtysh Hello, this is FRRouting (version 9.0.2). Copyright 1996-2005 Kunihiro Ishiguro, et al. firewall1.home.arpa# show ip route Codes: K - kernel route, C - connected, S - static, R - RIP, O - OSPF, I - IS-IS, B - BGP, E - EIGRP, T - Table, v - VNC, V - VNC-Direct, A - Babel, f - OpenFabric, > - selected route, * - FIB route, q - queued, r - rejected, b - backup t - trapped, o - offload failure K>* 0.0.0.0/0 [0/0] via 192.0.2.1, igb0, 00:00:21 C>* 192.0.2.0/29 [0/1] is directly connected, igb0, 00:00:21 C>* 192.168.41.0/24 [0/1] is directly connected, igb1, 00:00:21 C>* 192.168.57.0/24 [0/1] is directly connected, igb5, 00:00:21 firewall1.home.arpa# [2.7.2-RELEASE][root@firewall2.home.arpa]/root: vtysh Hello, this is FRRouting (version 9.0.2). Copyright 1996-2005 Kunihiro Ishiguro, et al. firewall2.home.arpa# show ip route Codes: K - kernel route, C - connected, S - static, R - RIP, O - OSPF, I - IS-IS, B - BGP, E - EIGRP, T - Table, v - VNC, V - VNC-Direct, A - Babel, f - OpenFabric, > - selected route, * - FIB route, q - queued, r - rejected, b - backup t - trapped, o - offload failure K>* 0.0.0.0/0 [0/0] via 192.0.2.9, vmx0, 00:00:36 C>* 192.0.2.8/29 [0/1] is directly connected, vmx0, 00:00:36 C>* 192.168.42.0/24 [0/1] is directly connected, vmx1, 00:00:36 C>* 192.168.57.0/24 [0/1] is directly connected, vmx5, 00:00:36 firewall2.home.arpa#I then stopped FRR, and installed FRR 9.1.1. Results: [2.7.2-RELEASE][root@firewall1.home.arpa]/tmp: vtysh Hello, this is FRRouting (version 9.1.1). Copyright 1996-2005 Kunihiro Ishiguro, et al. firewall1.home.arpa# show ip route Codes: K - kernel route, C - connected, S - static, R - RIP, O - OSPF, I - IS-IS, B - BGP, E - EIGRP, T - Table, v - VNC, V - VNC-Direct, A - Babel, f - OpenFabric, > - selected route, * - FIB route, q - queued, r - rejected, b - backup t - trapped, o - offload failure C>* 192.0.2.0/29 [0/1] is directly connected, igb0, 00:00:10 C>* 192.168.41.0/24 [0/1] is directly connected, igb1, 00:00:10 C>* 192.168.57.0/24 [0/1] is directly connected, igb5, 00:00:10 firewall1.home.arpa# [2.7.2-RELEASE][root@firewall2.home.arpa]/tmp: vtysh Hello, this is FRRouting (version 9.1.1). Copyright 1996-2005 Kunihiro Ishiguro, et al. firewall2.home.arpa# show ip route Codes: K - kernel route, C - connected, S - static, R - RIP, O - OSPF, I - IS-IS, B - BGP, E - EIGRP, T - Table, v - VNC, V - VNC-Direct, A - Babel, f - OpenFabric, > - selected route, * - FIB route, q - queued, r - rejected, b - backup t - trapped, o - offload failure C>* 192.0.2.8/29 [0/1] is directly connected, vmx0, 00:01:19 C>* 192.168.42.0/24 [0/1] is directly connected, vmx1, 00:01:19 C>* 192.168.57.0/24 [0/1] is directly connected, vmx5, 00:01:19 firewall2.home.arpa#So yes - NIC type doesn't seem to matter. Well there's over 20 NIC types to choose from in GNS3... but I'll just stop at 2 for now. We know it is at least not limited to vmxnet3. Also not related to OSPF as I am not running it in this lab. I am not running any routing protocols. I also checked the routing tables in FRR 9.1.1 about 20 minutes afterwards, and still no K routes. I wasn't even able to coax the K routes to come out by rebooting ISP1 and ISP2. In this lab, I only have a single WAN on each firewall, unlike my production lab, where I have dual WAN links. Thanks for the link to the upstream issues. I'll go through them and see if it's known. If not then I'll log the issue. 
- 
 If you're not running any dynamic routing on that lab and the default route is still missing - where is it supposed to come from? Is it just defined as a static route in Zebra? 
- 
 @marcosm I have updated 15623 with my findings. 9.0.3 works without issue for displaying 0/0 route 
- 
 @marcosm I don't know why the default route still works when it goes missing in FRR. Perhaps it falls back to the system/kernel routing table for any destinations not found in the FRR RIB/FIB? [2.7.2-RELEASE][root@firewall1.home.arpa]/root: vtysh Hello, this is FRRouting (version 9.1.1). Copyright 1996-2005 Kunihiro Ishiguro, et al. firewall1.home.arpa# show ip route Codes: K - kernel route, C - connected, S - static, R - RIP, O - OSPF, I - IS-IS, B - BGP, E - EIGRP, T - Table, v - VNC, V - VNC-Direct, A - Babel, f - OpenFabric, > - selected route, * - FIB route, q - queued, r - rejected, b - backup t - trapped, o - offload failure C>* 192.0.2.0/29 [0/1] is directly connected, igb0, 00:12:30 C>* 192.168.41.0/24 [0/1] is directly connected, igb1, 00:12:30 C>* 192.168.57.0/24 [0/1] is directly connected, igb5, 00:12:30 firewall1.home.arpa# show ip fib Codes: K - kernel route, C - connected, S - static, R - RIP, O - OSPF, I - IS-IS, B - BGP, E - EIGRP, T - Table, v - VNC, V - VNC-Direct, A - Babel, f - OpenFabric, > - selected route, * - FIB route, q - queued, r - rejected, b - backup t - trapped, o - offload failure C>* 192.0.2.0/29 [0/1] is directly connected, igb0, 00:12:33 C>* 192.168.41.0/24 [0/1] is directly connected, igb1, 00:12:33 C>* 192.168.57.0/24 [0/1] is directly connected, igb5, 00:12:33 firewall1.home.arpa# exit [2.7.2-RELEASE][root@firewall1.home.arpa]/root: netstat -rn4 Routing tables Internet: Destination Gateway Flags Netif Expire default 192.0.2.1 UGS igb0 127.0.0.1 link#8 UH lo0 192.0.2.0/29 link#1 U igb0 192.0.2.2 link#8 UHS lo0 192.168.41.0/24 link#2 U igb1 192.168.41.1 link#8 UHS lo0 192.168.57.0/24 link#6 U igb5 192.168.57.204 link#8 UHS lo0 [2.7.2-RELEASE][root@firewall1.home.arpa]/root: traceroute 8.8.8.8 traceroute to 8.8.8.8 (8.8.8.8), 64 hops max, 40 byte packets 1 192.0.2.1 (192.0.2.1) 1.736 ms 1.638 ms 1.182 ms 2 192.168.1.1 (192.168.1.1) 1.958 ms 2.197 ms 1.552 ms 3 192.168.57.1 (192.168.57.1) 2.678 ms 2.682 ms 2.554 ms 4 10.231.48.10 (10.231.48.10) 23.731 ms 22.466 ms 21.816 ms 5 ae10.chw-ice301.sydney.telstra.net (203.50.61.65) 22.058 ms 22.454 ms ae10.ken-ice301.sydney.telstra.net (203.50.61.81) 24.173 ms 6 bundle-ether25.hay-core30.sydney.telstra.net (203.50.61.80) 22.530 ms bundle-ether25.stl-core30.sydney.telstra.net (203.50.61.64) 24.026 ms bundle-ether25.hay-core30.sydney.telstra.net (203.50.61.80) 23.712 ms 7 bundle-ether1.chw-edge903.sydney.telstra.net (203.50.11.177) 22.088 ms 22.215 ms 22.276 ms 8 goo2503144.lnk.telstra.net (58.163.91.202) 23.689 ms goo2503069.lnk.telstra.net (58.163.91.194) 22.960 ms 72.14.212.22 (72.14.212.22) 23.076 ms 9 192.178.97.87 (192.178.97.87) 24.997 ms 192.178.98.33 (192.178.98.33) 23.710 ms 192.178.98.21 (192.178.98.21) 23.375 ms 10 142.251.64.177 (142.251.64.177) 23.976 ms 142.251.64.179 (142.251.64.179) 23.778 ms 216.239.56.69 (216.239.56.69) 23.841 ms 11 dns.google (8.8.8.8) 27.750 ms 23.449 ms 23.632 msIn my basic lab here, I just have the gateway configured like so: 
  Just to recap the issue in my production network: Since that default gateway / (default route) doesn't show up as a kernel route in FRR (from FRR 9.1 and onwards), the situation is that when the pfSense firewall with this affliction learns a default route via OSPF from another router device behind it (on the firewall "LAN"), the whole network Internet traffic still arrives at the pfSense firewall, because it is still advertising the default route into the network and has the lowest cost, but then the pfSense then decides to send the traffic back to the LAN router - back to where it came from. It crazily thinks that some high-cost OSPF route is a better option than its directly connected default gateway, and it shouldn't. It was working fine before that. Fine in 9.0.2 / 9.0.3. I guess not too many people run OSPF on their network with a competing default route? Otherwise people would be screaming about this issue all over the place. It would be broken for everyone right now on the new code, but the breakage only affects certain topologies. The secondary default route I have is across a 700Mbps microwave link to a remote site that also has an pfSense firewall with a fairly low-speed Internet link. If the main site's internet goes down, then the customer traffic ends up going to the LAN router (Mikrotik Cloud Core LAN router) as normal and then takes the high-cost OSPF route across the microwave link to the remote site and still have Internet (vs the low-cost OSPF route to the local pfSense router). I fixed this issue on the day by turning off the remote firewall's advertisement of OSPF (put in a temporary static route where needed). Then afterhours I used the Netgate Installer to reinstall the previous version of pfSense Plus, to get the earlier FRR back. Now both are advertising their default routes and redundancy is fine. Just want to get this issue solved so I can upgrade the main site to pfSense Plus 24.3... 24.8 etc. 
- 
 To clarify, is the default route is missing from both Zebra and the kernel, or just Zebra? It crazily thinks that some high-cost OSPF route is a better option than its directly connected default gateway, and it shouldn't. Is this happening while the lower-cost route exists in the kernel? If so, is that happening with newly established traffic as well (as in not traffic for which states already exist)? 
- 
 Please test this patched frr 9.1 version and let us know if the issue persists. 
- 
 @marcosm I tested that in my production lab by upgrading the lab PfSense Plus 23.x to 24.x and seeing the breakage (K routes disappearing), and then I stopped FRR and applied that patched version, and started it again - Kernel routes showing up. Rebooted - still have the K routes. Is there a bug reference ID you can link to? I'm really curious! I've spent days on this and would love to find out. Would you recommend I use this in production? Maybe I am best waiting for 24.8 - where perhaps an updated FRR build will have more testing? Then I can skip 24.3 altogether and just go straight to 24.8. This patched version has this: configured with: '--enable-user=frr' '--enable-group=frr' '--enable-vty-group=frrvty' '--enable-vtysh' '--disable-doc-html' '--sysconfdir=/var/etc/frr' '--localstatedir=/var/run/frr' '--disable-nhrpd' '--disable-pathd' '--disable-ospfclient' '--disable-pimd' '--disable-pbrd' '--with-vtysh-pager=cat' '--enable-backtrace' '--disable-config-rollbacks' '--disable-datacenter' '--enable-fpm' '--disable-ldpd' '--disable-doc' '--without-libpam' '--enable-rpki' '--disable-sharpd' '--disable-shell-access' '--enable-snmp' '--disable-tcmalloc' '--prefix=/usr/local' '--mandir=/usr/local/man' '--disable-silent-rules' '--infodir=/usr/local/share/info/' '--build=amd64-portbld-freebsd15.0' 'build_alias=amd64-portbld-freebsd15.0' 'PKG_CONFIG=pkgconf' 'PKG_CONFIG_LIBDIR=/wrkdirs/usr/ports/net/frr9/work/.pkgconfig:/usr/local/libdata/pkgconfig:/usr/local/share/pkgconfig:/usr/libdata/pkgconfig' 'CC=cc' 'CFLAGS=-O2 -pipe -fstack-protector-strong -fno-strict-aliasing ' 'LDFLAGS= -L/usr/local/lib -L/usr/local/lib -fstack-protector-strong ' 'LIBS=' 'CPPFLAGS=-I/usr/local/include -I/usr/local/include' 'CPP=cpp' 'CXX=c++' 'CXXFLAGS=-O2 -pipe -fstack-protector-strong -fno-strict-aliasing ' 'PYTHON=/usr/local/bin/python3.11'I compared it to other builds and nothing stands out. SNMP was off in one of the builds (for CE) and one of the other builds had "--mandir=/usr/local/share/man" instead of "--mandir=/usr/local/man" so am thinking that the fix was more than just build config. In case this info is still required.... even though the root cause seems to have been identified/fixed.... To clarify, is the default route is missing from both Zebra and the kernel, or just Zebra? The default route was missing just from Zebra. It was in the kernel. Is this happening while the lower-cost route exists in the kernel? Yes that's right. If so, is that happening with newly established traffic as well (as in not traffic for which states already exist)? Yes. Internet web browsing to new websites was broken. Traffic would go from a workstation to the Mikrotik cloud core LAN router. The Mikrotik could see the default route to the local pfsense, and also a default route over the microwave link to the other site's pfsense. The microwave link has a high OSPF cost, so the LAN router would correctly send the Internet traffic to the local pfSense. But then the local pfSense had an OSPF-learned route to the remote site over the microwave link and no K route for the local connected gateway, and bounced the traffic back to the LAN router, which then sent it back to the local pfsense . Can see that with traceroutes - traffic oscillating between firewall and LAN router until TTL timeout. I don't know how it all works, but my experience suggests that if a route exists in Zebra and is subsequently added to the Zebra FIB, then this is the forwarding that gets used. If Zebra has no RIB/FIB entry, then it falls back to the system RIB/FIB (as given by "netstat -rn") before failing. This layering would make sense so that Zebra can start and stop with the least amount of impact. It's a massive danger though when the kernel routes don't get pushed from system/kernel to Zebra, because an incomplete view can lead to extremely poor routing decisions. 
- 
 We found what looks to be the root cause - info has been posted to the Redmine report. The route redistribution issue still needs testing with the patched version, any help with that would be appreciated. I suggest waiting until we pick back the fix to 24.03 for your production systems. 
- 
 @marcosm said in Updating to pfSense+ 24.3 breaks routing - kernel routes now gone: Please test this patched frr 9.1 version and let us know if the issue persists. How do you install this? Sorry pretty new. Can I just scp this to my netgate 7100 and use some sort of package manager to install? Any particular process that won't break further releases? 
- 
 @mAineAc See the previous comment. 
- 
 @marcosm Yeah, after installing no change. rebooted no change. I don't see the default route in FRR and it is not redistributing the default route. 
- 
 
- 
 @marcosm I just tested in my production simulation lab and all looks good. I'll update the actual production firewall this weekend. This is a great result - thanks so much for your efforts - it's really appreciated. 
- 
 @marcosm Will this be coming to 24.08.a.20240702.0600? I am running this and the package listed does not seem to work and i am still having the same issue. I have not seen any updated packages. 
- 
 @mAineAc No - you'd have to build/install it manually for the public dev build. I'm not aware of any official bug report for the issue you're experiencing. My suggestion is to treat it like any other bug report: provide steps to reproduce it, and determine if it's a regression by finding the version(s) of the related software when it last worked. 
- 
 I just following up on this. We tried to upgrading from PFS 22.05/FRR 7.5.1 to PFS 24.11/FRR 9.1.2 We found that traffic was spotting and simply wouldn't route properly. If we turn down one of the 2 peers traffic would work perfectly. but as long as both peers were up traffic was spotty and would drop. We would like to stick with a netgate router but at this point we are looking to switch over to a cisco asr instead. 22.05 would be fine for us to stay on but unfortunately we can't downgrade a router and install the older frr anymore due to a php error. 
- 
 @Kevin-S-Pare Out of curiosity, do you have a high level diagram of how the pfsense is routing? Is a pfsense box with 2x upstream peers terminated on the same firewall? Is this OSPF or BGP? 
- 
 You got it. two peers advertising 2 /24's with bgp. Nothing fancy and quite basic. 
- 
 @Kevin-S-Pare Yeah pretty basic i agree. 
 So when you advertise your routes to both peers, what happens? I take it your upstream imports the routes and sends it out to their peers.
 What specifically is happening? So say you have Upstream1 and Upstream2. You are advertising your routes to both Upstreams and return traffic comes back on Upstream2 (don't know how you are steering traffic into your AS). What is spotty?
- 
 @michmoor what ends up happening is traffic is either not going out or not getting back. trace routes show as ok so do ping but when we try to get out to websites only certain ones work. and will work for a period and then the route is lost and we are unable to hit a site again. I was upgrading from an HP server to a netgate 8200 so we just went back to the old box and all works perfectly fine. heres a cleansed version of my config. ##################### DO NOT EDIT THIS FILE! ###################### 
 ###################################################################This file was created by an automatic configuration generator.The contents of this file will be overwritten without warning!################################################################### 
 !
 frr defaults traditional
 hostname hostname
 password password
 ip nht resolve-via-default
 service integrated-vtysh-config
 !
 router bgp 3
 bgp log-neighbor-changes
 bgp router-id 192.168.1.2
 no bgp network import-check
 bgp deterministic-med
 bgp always-compare-med
 bgp bestpath as-path multipath-relax
 neighbor 192.168.1.1 remote-as 1
 neighbor 192.168.1.1 description Peer1
 neighbor 192.168.1.1 timers 20 60
 neighbor 192.168.2.1 remote-as 2
 neighbor 192.168.2.1 description Peer2
 neighbor 192.168.2.1 timers 20 90
 !
 address-family ipv4 unicast
 network 192.168.10.0/24
 network 192.168.11.0/24
 neighbor 192.168.1.1 activate
 neighbor 192.168.2.1 activate
 no neighbor 192.168.1.1 send-community
 neighbor 192.168.1.1 next-hop-self
 neighbor 192.168.1.1 prefix-list PEER1-IN in
 neighbor 192.168.1.1 prefix-list PEER1-OUT out
 no neighbor 192.168.2.1 send-community
 neighbor 192.168.2.1 next-hop-self
 neighbor 192.168.2.1 prefix-list PEER2-IN in
 neighbor 192.168.2.1 prefix-list PEER2-OUT out
 exit-address-family
 !
 !
 ip prefix-list PEER1-IN seq 10 deny 0.0.0.0/8 le 32
 ip prefix-list PEER1-IN seq 20 deny 10.0.0.0/8 le 32
 ip prefix-list PEER1-IN seq 30 deny 127.0.0.0/8 le 32
 ip prefix-list PEER1-IN seq 40 deny 169.254.0.0/16 le 32
 ip prefix-list PEER1-IN seq 50 deny 172.16.0.0/12 le 32
 ip prefix-list PEER1-IN seq 60 deny 192.0.0.0/24 le 32
 ip prefix-list PEER1-IN seq 70 deny 192.0.2.0/24 le 32
 ip prefix-list PEER1-IN seq 80 deny 192.168.0.0/16 le 32
 ip prefix-list PEER1-IN seq 90 deny 198.18.0.0/15 le 32
 ip prefix-list PEER1-IN seq 100 deny 198.51.100.0/24 le 32
 ip prefix-list PEER1-IN seq 110 deny 203.0.113.0/24 le 32
 ip prefix-list PEER1-IN seq 120 deny 224.0.0.0/4 le 32
 ip prefix-list PEER1-IN seq 130 permit 0.0.0.0/0 le 32
 ip prefix-list PEER1-OUT seq 10 permit 192.168.10.0/24
 ip prefix-list PEER1-OUT seq 11 permit 192.168.11.0/24
 ip prefix-list PEER2-IN seq 10 deny 0.0.0.0/8 le 32
 ip prefix-list PEER2-IN seq 20 deny 10.0.0.0/8 le 32
 ip prefix-list PEER2-IN seq 30 deny 127.0.0.0/8 le 32
 ip prefix-list PEER2-IN seq 40 deny 169.254.0.0/16 le 32
 ip prefix-list PEER2-IN seq 50 deny 172.16.0.0/12 le 32
 ip prefix-list PEER2-IN seq 60 deny 192.0.0.0/24 le 32
 ip prefix-list PEER2-IN seq 70 deny 192.0.2.0/24 le 32
 ip prefix-list PEER2-IN seq 80 deny 192.168.0.0/16 le 32
 ip prefix-list PEER2-IN seq 90 deny 198.18.0.0/15 le 32
 ip prefix-list PEER2-IN seq 100 deny 198.51.100.0/24 le 32
 ip prefix-list PEER2-IN seq 110 deny 203.0.113.0/24 le 32
 ip prefix-list PEER2-IN seq 120 deny 224.0.0.0/4 le 32
 ip prefix-list PEER2-IN seq 130 permit 0.0.0.0/0 le 32
 ip prefix-list PEER2-OUT seq 10 permit 192.168.11.0/24
 ip prefix-list PEER2-OUT seq 11 permit 192.168.10.0/24
 !
 route-map ALLOW-ALL permit 100
 !
 line vty
 !
 
