Policy-based routing not using correct gateway/not populating gateway following gateway flap
-
Some background info:
pfSense verison: 22.05-RELEASE
Hardware: Netgate SG-3100
Connections:- Starlink on mvneta2
- VDSL via A&A (aa.net.uk) over PPPoE (via mvneta0)
- L2TP tunnel via A&A (aa.net.uk) over Starlink (via mvneta2)
Output of system aliases in /tmp/rules.debug:
#System aliases loopback = "{ lo0 }" STARLINK = "{ mvneta2 }" LAN = "{ mvneta1 }" AAISP = "{ pppoe0 }" GUEST = "{ mvneta1.102 }" IOT = "{ mvneta1.103 }" AAISP_STARLINK = "{ l2tp1 }" AAISP_MODEM = "{ mvneta0 }" Tailscale = "{ Tailscale }"
Output of gateways in /tmp/rules.debug when everything is working:
# Gateways GWSTARLINK_DHCP = " route-to ( mvneta2 100.64.0.1 ) " GWAAISP_PPPOE = " route-to ( pppoe0 81.187.81.187 ) " GWAAISP_STARLINK_L2TP = " route-to ( l2tp1 81.187.81.187 ) " GWSTARLINK_AAISP_FAILOVER = " route-to { ( mvneta2 100.64.0.1 ) } " GWAAISP_STARLINK_FAILOVER = " route-to { ( pppoe0 81.187.81.187 ) } " GWL2TP_STARLINK_AAISP_FAILOVER = " route-to { ( l2tp1 81.187.81.187 ) } " GWAAISP_L2TP_STARLINK_FAILOVER = " route-to { ( pppoe0 81.187.81.187 ) } "
The relevant user alias from /tmp/rules.debug:
table <AAISP_Outbound> { 172.25.1.221 172.25.1.222 172.25.1.223 172.25.1.224 172.25.1.225 } AAISP_Outbound = "<AAISP_Outbound>"
(That gateway IP is not sensitive; it's the PPP endpoint which is published on the A&A website: https://support.aa.net.uk/Server_List)
PBR is set to route a set of IP internal IPv4 addresses always out of GWAAISP_PPPOE with the following rule:
pass in quick on $LAN $GWAAISP_PPPOE inet from $AAISP_Outbound to any ridentifier 1653729485 keep state label "USER_RULE" label "id:1653729485" label "gw:AAISP_PPPOE"
As both my VDSL and L2TP connections are provided by the same ISP, the gateway address is identical (81.187.81.187), but this routes out the right interface after a fresh pfSense reboot.
The issue is that after a random number of gateway flaps on AAISP (pppoe0), pfSense loses the gateway for GWAAISP_PPPOE, even though it shows as up in the UI and the metrics for ping responses are updating. This could be after one gateway flap, it could be after ten - there doesn't seem to be a common scenario, but eventually pfSense does lose the gateway. When it loses the gateway, the output of gateways in /tmp/rules.debug look like this:
# Gateways GWSTARLINK_DHCP = " route-to ( mvneta2 100.64.0.1 ) " GWAAISP_PPPOE = " " GWAAISP_STARLINK_L2TP = " route-to ( l2tp1 81.187.81.187 ) " GWSTARLINK_AAISP_FAILOVER = " route-to { ( mvneta2 100.64.0.1 ) } " GWAAISP_STARLINK_FAILOVER = " route-to { ( mvneta2 100.64.0.1 ) } " GWL2TP_STARLINK_AAISP_FAILOVER = " route-to { ( l2tp1 81.187.81.187 ) } " GWAAISP_L2TP_STARLINK_FAILOVER = " route-to { ( l2tp1 81.187.81.187 ) } "
As you can see, pfSense doesn't re-populate the GWAAISP_PPPOE gateway after the connection comes back up. I've tried restarting dpinger and manually reconnecting the PPPoE interface (via Status > Interfaces), and neither work; the only thing that starts traffic routing back out of GWAAISP_PPPOE is to reboot pfSense entirely.
I'm kind of at a loss; everything seems to be recovering correctly, it's just pfSense seems to not pick up the gateway after a random number of gateway flaps.
I appreciate any pointers you folks can give, as I've tried everything I can think of.
Thank you in advance!