2.5.1: missing route to localhost (no joke)

612brokeaf

Here's a "funny" one:

Upgraded one node to 2.5.1 from 2.5.0 and my DNS resolver at localhost stopped working. In fact, it's IPv4 localhost that stopped working!

2.5.0:

[2.5.0-RELEASE][xx]/root: netstat -rn | grep ^127
127.0.0.1          link#3             UH          lo0
[2.5.0-RELEASE][xx]/root: ifconfig lo0
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384
	options=680003<RXCSUM,TXCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
	inet6 ::1 prefixlen 128
	inet6 fe80::1%lo0 prefixlen 64 scopeid 0x3
	inet6 xx prefixlen 128
	inet6 xx prefixlen 128
	inet 127.0.0.1 netmask 0xff000000
	inet xx netmask 0xffffffff
	inet xx netmask 0xffffffff
	groups: lo
	nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
[2.5.0-RELEASE][xx]/root: ping 127.0.0.1
PING 127.0.0.1 (127.0.0.1): 56 data bytes
64 bytes from 127.0.0.1: icmp_seq=0 ttl=64 time=4.759 ms

2.5.1:

[2.5.1-RELEASE][xx]/root: netstat -rn | grep ^127 | wc -l
       0
[2.5.1-RELEASE][xx]/root: ifconfig lo0
lo0: flags=8149<UP,LOOPBACK,RUNNING,PROMISC,MULTICAST> metric 0 mtu 16384
	options=680003<RXCSUM,TXCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
	inet6 ::1 prefixlen 128
	inet6 fe80::1%lo0 prefixlen 64 scopeid 0x3
	inet6 xx prefixlen 128
	inet6 xx prefixlen 128
	inet 127.0.0.1 netmask 0xff000000
	inet xx netmask 0xffffffff
	inet xx netmask 0xffffffff
	groups: lo
	nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
[2.5.1-RELEASE][xx]/root: ping 127.0.0.1
PING 127.0.0.1 (127.0.0.1): 56 data bytes
ping: sendto: Can't assign requested address

How is it even possible for a directly connected route not to show in the routing table, unless it is specifically removed... Looks like some odd side effect where something went wrong while applying some specific config. Obviously there was no such issue with 2.5.0. Note that I do have config that touches lo0, and that's multiple secondary IPs (using the VIP functionality) that I use for routing.

Any clues?

Gertjan

@612brokeaf

Yeah, a couple of weeks ago, since 2.5.1, commands like dig, ping and others showed a 'error' messages that localhost (127.0.0.1) was absent.

To get it back : reboot.

I guess there is already a redmine issue for it.

viktor_g

Unable to reproduce, but it may be related to https://redmine.pfsense.org/issues/11806

You can try this patch: 221.diff

612brokeaf

@gertjan Rebooted n times, no change.

Gertjan

@viktor_g said in 2.5.1: missing route to localhost (no joke):

Unable to reproduc

I was referring to these forum post.

The issue should pop up when you declare something like this :

612brokeaf

@viktor_g OK, that patch works, albeit not completely. I now have the route to 127.0.0.1 in the table, but another route for localhost (a secondary 172.16.x.x/32) has disappeared after rebooting, meaning my routing is now completely broken until I manually add that route pointing it to localhost, because I rely on this for BGP etc. Interestingly, there is another /32 on lo0 from the same range that I use as GRE source, and that was unaffected.

viktor_g

@612brokeaf said in 2.5.1: missing route to localhost (no joke):

@viktor_g OK, that patch works, albeit not completely. I now have the route to 127.0.0.1 in the table, but another route for localhost (a secondary 172.16.x.x/32) has disappeared after rebooting, meaning my routing is now completely broken until I manually add that route pointing it to localhost, because I rely on this for BGP etc. Interestingly, there is another /32 on lo0 from the same range that I use as GRE source, and that was unaffected.

Could you show your complete routing config?

612brokeaf

@viktor_g Not unless you have a config sanitisation tool where I could securely paste the XML and hide config details.

I can describe what I have though.

I have a hub and spoke type setup with multiple pfSense hosts in different regions as the hub(s). Spokes are a mix of pfSense and traditional big name hardware vendors.

On each hub:

10+ pairs of IPSec tunnels over GRE (several spokes + full mesh between hubs), meaning 10+ GRE interfaces, times two - for each location there is v4 and v6 (IPSec tunnel + GRE for each)
4 x extra secondary IPs on lo0 (Firewall -> VIPs -> type: alias): two IPv4 (172.16.x.x/32) and two IPv6 (fd00:xx::xx/128). For both v4 and v6, one is used as GRE source and this -> remote is what the IPSec tunnels cover, and the other is a general loopback for services/router IDs/BGP peering.
Running FRR with OSPF + OSPF3 to distribute v4 and v6 loopbacks, and BGP via those loopbacks, hubs run a route reflector.
A single WAN interface with a primary static v4 IP and multiple secondary v4 IPs. Secondaries / extra v4 IPs are /32s, gateway is the primary gateway. Also a /56 public IPv6 on each, ND-RA/SLAAC with /64 PDs.

I think possibly the issue triggers when setting up aliases on lo0. After the upgrade from 2.5.0 to 2.5.1, the 127.0.0.1 route was gone from routing table, even though the IP was configured correctly. After the patch you suggested, the route for 127.0.0.1 was in, but the route for another v4 alias for lo0 was not, while the third one was in. This broke most of my VPN tunnels, because some spokes have dynamic IPs and DNS is used to resolve the IPSec tunnel endpoints. Dnsmasq listens on 127.0.0.1 and this is what indirectly broke things. Before the patch I changed the local DNS server to 172.16.x.x as a workaround, but with the patch, that IP didn't make it into the routing table, resulting in the same issue.

For now I added manual shellcmds to install the missing lo0 routes on boot.

For completeness: I have another manual modification in place, in /etc/inc/config.lib.inc, and that is changing alias_make_table(); to alias_make_table($config);, because otherwise I kept getting crash reports / PHP errors complaining about alias_make_table being called with zero arguments and expecting one. This was being triggered from the ACME cert renewal cron job. There is also another bug in ACME, complaining about the function getarraybyref() not found. Even though all PHP include chains look fine, I can't find another way to fix this than pasting that function into the same scope in ACME. This is for another topic though - this issue looked fixed in 2.5.0, but maybe I fixed it by hand and forgot about it until 2.5.1.

612brokeaf

Instead of shellcmds, I added manual static routes to 127/8 and the two other /32s I have on lo0 in the GUI. The node now survives reboot entirely intact.

Side note: When adding static routes, gateway / interface selection lists lo0 as respectively "null4" and "null6". This naming is a little confusing - to a network engineer this looks like blackhole routes, and it is probably meant to be exactly that, but this hints that there may be additional rules in place that actually drop the traffic rather than just push it to the CPU, just like in other network OSes there can be a dedicated null interface.

612brokeaf

Correction: setting those missing loopback routes as static routes apparently only fixed it on one node and only temporarily.

@viktor_g looks like the patch did not change much - I ran debug on that function and it didn't seem to be touching the v4 loopback, so this may be elsewhere - possibly IPSec scripts, since there were so many fixes in 2.5.1? Probably a good test would be to stop/start IPSec and see if this breaks the loopback again, at least it would narrow this down somewhat.

Anyhow, I added shellcmds (regular, not early) adding routes to the various lo0 addresses, and that seems to have worked so far, 10+ reboots. It's an ugly fix but I'm not touching it until some proper resolution comes up. I've ran out of downtime credits for now so can't test much for the next dew days.

viktor_g

@612brokeaf said in 2.5.1: missing route to localhost (no joke):

@viktor_g OK, that patch works, albeit not completely. I now have the route to 127.0.0.1 in the table, but another route for localhost (a secondary 172.16.x.x/32) has disappeared after rebooting, meaning my routing is now completely broken until I manually add that route pointing it to localhost, because I rely on this for BGP etc. Interestingly, there is another /32 on lo0 from the same range that I use as GRE source, and that was unaffected.

Unable to reproduce on the latest dev snapshot:
Screenshot from 2021-05-09 16-04-01.png

all OK after rebooting:

# netstat -rn | grep 127
5.5.5.0/24         127.0.0.1          UGSB        lo0
6.6.6.6/32         127.0.0.1          UGSB        lo0
127.0.0.1          link#5             UH          lo0

viktor_g

@612brokeaf said in 2.5.1: missing route to localhost (no joke):

For completeness: I have another manual modification in place, in /etc/inc/config.lib.inc, and that is changing alias_make_table(); to alias_make_table($config);, because otherwise I kept getting crash reports / PHP errors complaining about alias_make_table being called with zero arguments and expecting one. This was being triggered from the ACME cert renewal cron job. There is also another bug in ACME, complaining about the function getarraybyref() not found. Even though all PHP include chains look fine, I can't find another way to fix this than pasting that function into the same scope in ACME. This is for another topic though - this issue looked fixed in 2.5.0, but maybe I fixed it by hand and forgot about it until 2.5.1.

Please create a bugreport about this issue:
https://docs.netgate.com/pfsense/en/latest/development/bug-reports.html