Failover didn't fall back to tier1 after downtime

heper

-Tier1 came back online
-Gateway group showed both 'online'
-Default route = Tier1
–---Default-gateway switching = enabled

still, for whatever reason, it kept sending "some" clients out through tier2 - this was still happening 10hours after the last gateway event.

to fix it i clicked "reset all states" in the GUI.
(it's impossible that the states were still alive from before the gateway-event, because nobody was around at 2am in the morning)


May 2 13:38:35 	dpinger 		send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 40% dest_addr 195.130.130.11 bind_addr 81.82.213.131 identifier "WAN_TELENET0 "
May 2 13:38:35 	dpinger 		send_interval 500ms loss_interval 10000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 40% dest_addr 195.238.2.21 bind_addr 192.168.5.2 identifier "WAN_SCARLETGW "
May 1 02:14:26 	dpinger 		WAN_TELENET0 195.130.130.11: Clear latency 9788us stddev 1395us loss 29%
May 1 02:13:26 	dpinger 		WAN_TELENET0 195.130.130.11: Alarm latency 9607us stddev 1263us loss 41%

The values set for dpinger are those that made APINGER work somewhat reliably.
perhaps the values need to be set to sane values, now that we have a good pinger ?

That said, the system has been up for 20 days, and a couple of failover events took place … this is the first time it didn't fall back.

suggestions?

heper

since first post it happened again, 3 times to be exact.
i've reset all dpinger value's/variables to their default settings by GUI.

dpinger clears the error, but pfsense keeps sending traffic towards the Tier2 gateway (identical as first post).
i'm thinking the 'clear' isn't (always/under every circumstance) picked up by the backend code.

today i changed the trigger level from 'member down' -to-> 'packetloss or high latency'.
will update this thread with updates in the next couple of days/weeks

maverick_slo

Maybe you`re hitting this: https://redmine.pfsense.org/issues/6110 ?

heper

perhaps, but no PPP(oE) involved. it is possible that default gateway switching is still enabled (from back in the day when there was a transparent proxy running). will check if disabling this makes a difference

cmb

Guessing it's already-established connections that are staying there maybe? That'd be expected.

Two things influence traffic routing. Guessing your clients are being routed via a gateway group, which you can verify on the back end with:

grep route-to /tmp/rules.debug

The other thing would be the default gateway, for traffic matching firewall rules set to "default" rather than a gateway group. Check Diag>Routes to verify that.

heper

@cmb:

Guessing it's already-established connections that are staying there maybe? That'd be expected.

it kept sending new clients towards tier2 for days after the gateway event … can't have been (all) established connections

@cmb:

Two things influence traffic routing. Guessing your clients are being routed via a gateway group, which you can verify on the back end with:
grep route-to /tmp/rules.debug
The other thing would be the default gateway, for traffic matching firewall rules set to "default" rather than a gateway group. Check Diag>Routes to verify that.

will check the rules.debug when/if it happens next