GW status green, but gateway gone
we have two gateways configured for fail-over and our setup worked really well for quite a long time. A few weeks ago we updated to 2.4.4-RELEASE-p2. Since then we observe strange things.
This is an example what looks like a normal "gateway gone/back-again" message:
16:09:03 MONITOR: GW_WAN2 is down, omitting from routing group gw_group_wan
2:03:00 33057MONITOR: GW_WAN2 is available now, adding to routing group gw_group_wan
Sometimes I see that the gateway is gone for a few seconds only and then recovers again.
Last week we saw these messages several hundred times per day (GW was gone for a few seconds and then recovered) and I thought that the ISP had problems. I contacted them and they said that there were no problems. We rebooted pfSense and everything was ok again. Hmmm...
Yesterday (this is where the example above is from) we got the "gone message" again, but no "back again message". We route one VLAN through the gateway that was gone and that VLAN was disconnected, even if pfSense is configured to route through the other interface when this is gone. I checked the UI and GW status was green all time. So I thought it would be a "pfSense burp" and the problem was with the connected clients or I did miss the back again mail.
We have setup a script the reacts on GW gone and back again messages. As this is run via cron it is triggered only once per minute and because the outages were only for a few seconds they always missed the cron job.
Yesterday the outage was for several hours and the cron job did not fire. Means, the Web UI did not see the outage, the gateway was green all time, the script did not see the outage, but the VLAN was disconnected and the GW did not switch and in the morning we got the back again message.
I saw in the release notes that something with GW handling has changed, but I was sure it was not of relevance for our simple case. We use the second GW for fail-over and we route this low-traffic VLAN via the fall-back GW. If the fail-over GW is gone this VLAN is routed through the default GW. Yesterday this did not happen, too. The setup looks like this:
GW1 : default for LAN (fail-over for VLAN)
GW2 : fail-over for LAN (default for that VLAN)
GW-GROUP-1 : GW1 -> Tier1, GW2 -> Tier2
GW-GROUP-2 : GW1 -> Tier2, GW2 -> Tier1
All routed via GW-GROUP-1
VLAN routed via GW-GROUP-2
To make that long story short, the GW2 was gone, Web UI did not see that and was always green, pfSense GW switching logic did not fire, our cron job did not see that. Why? What was going wrong?
Can someone please explain what was happening or what has changed in the new pfSense version that could cause this behaviour?