Multi-WAN - One of two WAN in failover drops ~1-2 min. for unknown reason

wm408

Hi,

Note: the log sample below shows newer records at the top, older at the bottom.

I get recurring messages regarding one of my gateways going down, intermittently. Sometimes within minutes, there will be a repeat. Other times, hours will go by without any activity:

Sep 23 17:11:37	php-fpm	330	/rc.dyndns.update: MONITOR: WAN2GW is available now, adding to routing group WAN2failtoWAN 8.8.4.4|74.51.222.14|WAN2GW|15.522ms|4.223ms|0.0%|none
Sep 23 17:11:36	check_reload_status		Reloading filter
Sep 23 17:11:36	check_reload_status		Restarting OpenVPN tunnels/interfaces
Sep 23 17:11:36	check_reload_status		Restarting ipsec tunnels
Sep 23 17:11:36	check_reload_status		updating dyndns WAN2GW
Sep 23 17:10:20	php-fpm	328	/rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed its IP. Reloading endpoints that may use WAN2GW.
Sep 23 17:10:20	php-fpm	328	/rc.dyndns.update: MONITOR: WAN2GW is down, omitting from routing group WAN2failtoWAN 8.8.4.4|74.51.222.14|WAN2GW|1341.057ms|2554.332ms|0.0%|down
Sep 23 17:10:19	check_reload_status		Reloading filter
Sep 23 17:10:19	check_reload_status		Restarting OpenVPN tunnels/interfaces
Sep 23 17:10:19	check_reload_status		Restarting ipsec tunnels
Sep 23 17:10:19	check_reload_status		updating dyndns WAN2GW
Sep 23 17:06:32	php-fpm	329	/rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed its IP. Reloading endpoints that may use WAN2GW.
Sep 23 17:06:32	php-fpm	329	/rc.dyndns.update: MONITOR: WAN2GW is available now, adding to routing group WAN2failtoWAN 8.8.4.4|74.51.222.14|WAN2GW|69.166ms|119.861ms|0.0%|none
Sep 23 17:06:31	check_reload_status		Reloading filter
Sep 23 17:06:31	check_reload_status		Restarting OpenVPN tunnels/interfaces
Sep 23 17:06:31	check_reload_status		Restarting ipsec tunnels
Sep 23 17:06:31	check_reload_status		updating dyndns WAN2GW

This only seems to happen when the check_reload_status occurs.

Any thoughts or comments?

Thanks.

wm408

Bump.

Jimp? or cmb?

Any thoughts from you guys? 8)

heper

CMB has gone to where the grass appears greener.

Check your gateway logs. That should provide more insights in the reason why the gateway goes down

wm408

Hi,

I am going to test "Set ping payload size" to the problematic gateway. cmb advised this for "…buggy upstream devices...", which I may be experiencing here. Refer to this post: https://forum.pfsense.org/index.php?topic=110043.0

This is a sample from my Gateways log:

Oct 1 10:45:23 dpinger WAN2GW 8.8.4.4: Clear latency 209224us stddev 412652us loss 0%
Oct 1 10:43:40 dpinger WAN2GW 8.8.4.4: Alarm latency 762289us stddev 1626523us loss 0%
Oct 1 10:29:15 dpinger WAN2GW 8.8.4.4: Clear latency 42259us stddev 123654us loss 0%
Oct 1 10:27:10 dpinger WAN2GW 8.8.4.4: Alarm latency 755434us stddev 2017180us loss 4%
Oct 1 10:14:02 dpinger WAN2GW 8.8.4.4: Clear latency 443760us stddev 978113us loss 0%
Oct 1 10:13:35 dpinger WAN2GW 8.8.4.4: Alarm latency 504314us stddev 971819us loss 0%

I didn't know Chris had moved on, till now. I saw his post. Makes sense! thanks for the heads up.

@heper:

CMB has gone to where the grass appears greener.

Check your gateway logs. That should provide more insights in the reason why the gateway goes down

Derelict

Well, there you go. dpinger is doing its job.

If you have gateway monitoring on WAN (the default setting), the system is automatically keeping track of two pings per second in Status > Monitoring.

From there select settings, change the left axis to Quality / WANGW (or the local equivalent).

A good place to start with Options: 8 hours, Resolution: 1 minute.

Another place to check is in Status > System Logs, Gateways. Any events there with "Alarm" in them are times when the ping monitor had excessive loss or latency.

A failure will look something like this: Jan 7 15:05:31 dpinger WANGW 8.8.8.8: Alarm latency 0us stddev 0us loss 100%

Lines like this are just the dpinger process starting or reloading and are normal:

dpinger send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr 8.8.4.4 bind_addr 198.51.0.16 identifier "DSLGW "

Sometimes it is beneficial to change your monitoring address to something further out. In that example you can see that I am monitoring a google DNS server there. In general, monitoring the ISP gateway is fine if it reliably responds to pings. Changes to the monitor IP address can be made in System > Routing and editing the appropriate gateway.

wm408

Hi Derelict,

Typically for the Monitor IP, I choose the ISP gateway or one hop past (as observed with traceroute). But lately for at least testing, I've set the problematic gateway's Monitor IP to a google DNS server also as that's been a popular choice throughout the forums.

Thanks for your other tips. I will circle back and review each of your points after I look at the results with the topic I mentioned in an earlier post, re: ping payload size.

@Derelict:

Well, there you go. dpinger is doing its job.

If you have gateway monitoring on WAN (the default setting), the system is automatically keeping track of two pings per second in Status > Monitoring.

From there select settings, change the left axis to Quality / WANGW (or the local equivalent).

A good place to start with Options: 8 hours, Resolution: 1 minute.

Another place to check is in Status > System Logs, Gateways. Any events there with "Alarm" in them are times when the ping monitor had excessive loss or latency.

A failure will look something like this: Jan 7 15:05:31 dpinger WANGW 8.8.8.8: Alarm latency 0us stddev 0us loss 100%

Lines like this are just the dpinger process starting or reloading and are normal:

dpinger send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr 8.8.4.4 bind_addr 198.51.0.16 identifier "DSLGW "

Sometimes it is beneficial to change your monitoring address to something further out. In that example you can see that I am monitoring a google DNS server there. In general, monitoring the ISP gateway is fine if it reliably responds to pings. Changes to the monitor IP address can be made in System > Routing and editing the appropriate gateway.

wm408

Hi,

After reviewing the ping payload size, and also your recommendations, I still have the same issue.
Let me know if any other suggestions come to mind. Thx.

Oct 7 15:31:19	dpinger		WAN2GW 8.8.4.4: duplicate echo reply received
Oct 7 15:31:19	dpinger		WAN2GW 8.8.4.4: duplicate echo reply received
Oct 7 15:29:46	dpinger		WAN2GW 8.8.4.4: Alarm latency 46725667us stddev 0us loss 95%
Oct 7 15:28:14	dpinger		WAN2GW 8.8.4.4: Alarm latency 15032us stddev 3426us loss 25%
Oct 7 15:26:44	dpinger		WAN2GW 8.8.4.4: Clear latency 15014us stddev 2740us loss 0%

@wm408:

Hi Derelict,

Typically for the Monitor IP, I choose the ISP gateway or one hop past (as observed with traceroute). But lately for at least testing, I've set the problematic gateway's Monitor IP to a google DNS server also as that's been a popular choice throughout the forums.

Thanks for your other tips. I will circle back and review each of your points after I look at the results with the topic I mentioned in an earlier post, re: ping payload size.

@Derelict:

Well, there you go. dpinger is doing its job.

If you have gateway monitoring on WAN (the default setting), the system is automatically keeping track of two pings per second in Status > Monitoring.

From there select settings, change the left axis to Quality / WANGW (or the local equivalent).

A good place to start with Options: 8 hours, Resolution: 1 minute.

Another place to check is in Status > System Logs, Gateways. Any events there with "Alarm" in them are times when the ping monitor had excessive loss or latency.

A failure will look something like this: Jan 7 15:05:31 dpinger WANGW 8.8.8.8: Alarm latency 0us stddev 0us loss 100%

Lines like this are just the dpinger process starting or reloading and are normal:

dpinger send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr 8.8.4.4 bind_addr 198.51.0.16 identifier "DSLGW "

Sometimes it is beneficial to change your monitoring address to something further out. In that example you can see that I am monitoring a google DNS server there. In general, monitoring the ISP gateway is fine if it reliably responds to pings. Changes to the monitor IP address can be made in System > Routing and editing the appropriate gateway.