2.5.1: missing route to localhost (no joke)
-
Here's a "funny" one:
Upgraded one node to 2.5.1 from 2.5.0 and my DNS resolver at localhost stopped working. In fact, it's IPv4 localhost that stopped working!
2.5.0:
[2.5.0-RELEASE][xx]/root: netstat -rn | grep ^127 127.0.0.1 link#3 UH lo0 [2.5.0-RELEASE][xx]/root: ifconfig lo0 lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384 options=680003<RXCSUM,TXCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6> inet6 ::1 prefixlen 128 inet6 fe80::1%lo0 prefixlen 64 scopeid 0x3 inet6 xx prefixlen 128 inet6 xx prefixlen 128 inet 127.0.0.1 netmask 0xff000000 inet xx netmask 0xffffffff inet xx netmask 0xffffffff groups: lo nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL> [2.5.0-RELEASE][xx]/root: ping 127.0.0.1 PING 127.0.0.1 (127.0.0.1): 56 data bytes 64 bytes from 127.0.0.1: icmp_seq=0 ttl=64 time=4.759 ms
2.5.1:
[2.5.1-RELEASE][xx]/root: netstat -rn | grep ^127 | wc -l 0 [2.5.1-RELEASE][xx]/root: ifconfig lo0 lo0: flags=8149<UP,LOOPBACK,RUNNING,PROMISC,MULTICAST> metric 0 mtu 16384 options=680003<RXCSUM,TXCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6> inet6 ::1 prefixlen 128 inet6 fe80::1%lo0 prefixlen 64 scopeid 0x3 inet6 xx prefixlen 128 inet6 xx prefixlen 128 inet 127.0.0.1 netmask 0xff000000 inet xx netmask 0xffffffff inet xx netmask 0xffffffff groups: lo nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL> [2.5.1-RELEASE][xx]/root: ping 127.0.0.1 PING 127.0.0.1 (127.0.0.1): 56 data bytes ping: sendto: Can't assign requested address
How is it even possible for a directly connected route not to show in the routing table, unless it is specifically removed... Looks like some odd side effect where something went wrong while applying some specific config. Obviously there was no such issue with 2.5.0. Note that I do have config that touches lo0, and that's multiple secondary IPs (using the VIP functionality) that I use for routing.
Any clues?
-
Yeah, a couple of weeks ago, since 2.5.1, commands like dig, ping and others showed a 'error' messages that localhost (127.0.0.1) was absent.
To get it back : reboot.
I guess there is already a redmine issue for it.
-
Unable to reproduce, but it may be related to https://redmine.pfsense.org/issues/11806
You can try this patch: 221.diff
-
@gertjan Rebooted n times, no change.
-
@viktor_g said in 2.5.1: missing route to localhost (no joke):
Unable to reproduc
I was referring to these forum post.
The issue should pop up when you declare something like this :
-
@viktor_g OK, that patch works, albeit not completely. I now have the route to 127.0.0.1 in the table, but another route for localhost (a secondary 172.16.x.x/32) has disappeared after rebooting, meaning my routing is now completely broken until I manually add that route pointing it to localhost, because I rely on this for BGP etc. Interestingly, there is another /32 on lo0 from the same range that I use as GRE source, and that was unaffected.
-
@612brokeaf said in 2.5.1: missing route to localhost (no joke):
@viktor_g OK, that patch works, albeit not completely. I now have the route to 127.0.0.1 in the table, but another route for localhost (a secondary 172.16.x.x/32) has disappeared after rebooting, meaning my routing is now completely broken until I manually add that route pointing it to localhost, because I rely on this for BGP etc. Interestingly, there is another /32 on lo0 from the same range that I use as GRE source, and that was unaffected.
Could you show your complete routing config?
-
@viktor_g Not unless you have a config sanitisation tool where I could securely paste the XML and hide config details.
I can describe what I have though.
I have a hub and spoke type setup with multiple pfSense hosts in different regions as the hub(s). Spokes are a mix of pfSense and traditional big name hardware vendors.
On each hub:
- 10+ pairs of IPSec tunnels over GRE (several spokes + full mesh between hubs), meaning 10+ GRE interfaces, times two - for each location there is v4 and v6 (IPSec tunnel + GRE for each)
- 4 x extra secondary IPs on lo0 (Firewall -> VIPs -> type: alias): two IPv4 (172.16.x.x/32) and two IPv6 (fd00:xx::xx/128). For both v4 and v6, one is used as GRE source and this -> remote is what the IPSec tunnels cover, and the other is a general loopback for services/router IDs/BGP peering.
- Running FRR with OSPF + OSPF3 to distribute v4 and v6 loopbacks, and BGP via those loopbacks, hubs run a route reflector.
- A single WAN interface with a primary static v4 IP and multiple secondary v4 IPs. Secondaries / extra v4 IPs are /32s, gateway is the primary gateway. Also a /56 public IPv6 on each, ND-RA/SLAAC with /64 PDs.
I think possibly the issue triggers when setting up aliases on lo0. After the upgrade from 2.5.0 to 2.5.1, the 127.0.0.1 route was gone from routing table, even though the IP was configured correctly. After the patch you suggested, the route for 127.0.0.1 was in, but the route for another v4 alias for lo0 was not, while the third one was in. This broke most of my VPN tunnels, because some spokes have dynamic IPs and DNS is used to resolve the IPSec tunnel endpoints. Dnsmasq listens on 127.0.0.1 and this is what indirectly broke things. Before the patch I changed the local DNS server to 172.16.x.x as a workaround, but with the patch, that IP didn't make it into the routing table, resulting in the same issue.
For now I added manual shellcmds to install the missing lo0 routes on boot.
For completeness: I have another manual modification in place, in
/etc/inc/config.lib.inc
, and that is changingalias_make_table();
toalias_make_table($config);
, because otherwise I kept getting crash reports / PHP errors complaining aboutalias_make_table
being called with zero arguments and expecting one. This was being triggered from the ACME cert renewal cron job. There is also another bug in ACME, complaining about the functiongetarraybyref()
not found. Even though all PHP include chains look fine, I can't find another way to fix this than pasting that function into the same scope in ACME. This is for another topic though - this issue looked fixed in 2.5.0, but maybe I fixed it by hand and forgot about it until 2.5.1. -
Instead of shellcmds, I added manual static routes to 127/8 and the two other /32s I have on lo0 in the GUI. The node now survives reboot entirely intact.
Side note: When adding static routes, gateway / interface selection lists lo0 as respectively "null4" and "null6". This naming is a little confusing - to a network engineer this looks like blackhole routes, and it is probably meant to be exactly that, but this hints that there may be additional rules in place that actually drop the traffic rather than just push it to the CPU, just like in other network OSes there can be a dedicated null interface.
-
Correction: setting those missing loopback routes as static routes apparently only fixed it on one node and only temporarily.
@viktor_g looks like the patch did not change much - I ran debug on that function and it didn't seem to be touching the v4 loopback, so this may be elsewhere - possibly IPSec scripts, since there were so many fixes in 2.5.1? Probably a good test would be to stop/start IPSec and see if this breaks the loopback again, at least it would narrow this down somewhat.
Anyhow, I added shellcmds (regular, not early) adding routes to the various lo0 addresses, and that seems to have worked so far, 10+ reboots. It's an ugly fix but I'm not touching it until some proper resolution comes up. I've ran out of downtime credits for now so can't test much for the next dew days.
-
@612brokeaf said in 2.5.1: missing route to localhost (no joke):
@viktor_g OK, that patch works, albeit not completely. I now have the route to 127.0.0.1 in the table, but another route for localhost (a secondary 172.16.x.x/32) has disappeared after rebooting, meaning my routing is now completely broken until I manually add that route pointing it to localhost, because I rely on this for BGP etc. Interestingly, there is another /32 on lo0 from the same range that I use as GRE source, and that was unaffected.
Unable to reproduce on the latest dev snapshot:
all OK after rebooting:
# netstat -rn | grep 127 5.5.5.0/24 127.0.0.1 UGSB lo0 6.6.6.6/32 127.0.0.1 UGSB lo0 127.0.0.1 link#5 UH lo0
-
@612brokeaf said in 2.5.1: missing route to localhost (no joke):
For completeness: I have another manual modification in place, in /etc/inc/config.lib.inc, and that is changing alias_make_table(); to alias_make_table($config);, because otherwise I kept getting crash reports / PHP errors complaining about alias_make_table being called with zero arguments and expecting one. This was being triggered from the ACME cert renewal cron job. There is also another bug in ACME, complaining about the function getarraybyref() not found. Even though all PHP include chains look fine, I can't find another way to fix this than pasting that function into the same scope in ACME. This is for another topic though - this issue looked fixed in 2.5.0, but maybe I fixed it by hand and forgot about it until 2.5.1.
Please create a bugreport about this issue:
https://docs.netgate.com/pfsense/en/latest/development/bug-reports.html