WAN2 goes down for packet loss, doesn't come back up until gateways page viewed

SteveITS

Have a client's SG-2440 with two WANs (it's the same one as https://forum.netgate.com/topic/147889/member-down-triggering-with-0-loss actually, though that's probably not relevant as that's on WAN1). It's on 2.4.4-p3. On multiple occasions this year if WAN2 goes down, it stays down until I log in to the router and view the gateways page, at which point pfSense suddenly realizes the connection is up again. Logs:

Jul 6 18:01:32 	php-cgi 		notify_monitor.php: Message sent to support@example.com OK
Jul 6 18:01:30 	php-fpm 	77236 	/system_gateways.php: 77236MONITOR: WAN2_DHCP is available now, adding to routing group GWGROUP 8.8.8.8|172.16.0.51|WAN2_DHCP|18.194ms|0.461ms|0.0%|none
Jul 6 18:01:22 	php-fpm 	3179 	/index.php: Successful login for user 'admin' from: 173.x.x.x (Local Database)
Jul 6 17:06:12 	check_reload_status 		Reloading filter
Jul 6 17:06:12 	check_reload_status 		Restarting OpenVPN tunnels/interfaces
Jul 6 17:06:12 	check_reload_status 		Restarting ipsec tunnels
Jul 6 17:06:12 	check_reload_status 		updating dyndns WAN2_DHCP
Jul 6 17:06:12 	rc.gateway_alarm 	99608 	>>> Gateway alarm: WAN2_DHCP (Addr:8.8.8.8 Alarm:0 RTT:18.124ms RTTsd:.421ms Loss:13%)
Jul 6 17:03:26 	php-cgi 		notify_monitor.php: Message sent to support@example.com OK
Jul 6 17:03:26 	php-fpm 	3179 	/rc.openvpn: MONITOR: WAN2_DHCP is down, omitting from routing group GWGROUP 8.8.8.8|172.16.0.51|WAN2_DHCP|18.312ms|0.485ms|22%|down
Jul 6 17:03:25 	check_reload_status 		Reloading filter
Jul 6 17:03:25 	check_reload_status 		Restarting OpenVPN tunnels/interfaces
Jul 6 17:03:25 	check_reload_status 		Restarting ipsec tunnels
Jul 6 17:03:25 	check_reload_status 		updating dyndns WAN2_DHCP
Jul 6 17:03:25 	rc.gateway_alarm 	79942 	>>> Gateway alarm: WAN2_DHCP (Addr:8.8.8.8 Alarm:1 RTT:18.312ms RTTsd:.460ms Loss:21%)

It doesn't matter if the delay for logging in is an hour or a couple days, it's immediate upon viewing the system_gateways.php page. Is there some way to get it to realize WAN2 is online again?

WAN1 doesn't seem to have this problem.

serbus

Hello!

Could be related to:

https://redmine.pfsense.org/issues/9450

John

SteveITS

Hmm, sounds similar. dpinger logged:

Jul 6 17:06:12 dpinger WAN2_DHCP 8.8.8.8: Clear latency 18124us stddev 421us loss 13%
Jul 6 17:03:25 dpinger WAN2_DHCP 8.8.8.8: Alarm latency 18312us stddev 460us loss 21%

So that cleared the gateway down because it was under 20% packet loss?

I definitely do not have to save the gateway but I have clicked the edit button to open the gateway. I can try next time to just sit on the system_gateways.php page for a bit and see if it sends the email.

serbus

Hello!

If you want to go-kludge, you could run some code like this when the gw status is out of sync :

/***********************************************************************/
#!/usr/local/bin/php-cgi -q
<?php
require_once("gwlb.inc");

$options = getopt("g:");

$members = [];

if ($options['g'] <> "") {
$gwgroup = $options['g'];
}

if (!empty($gwgroup)) {
$members = get_gwgroup_members($gwgroup);
}

var_dump ($members);

?>
/***********************************************************************/

Run...

php /saved/here/named_this.php -g="GWGRP_Name"

...from a shell/cron/DiagCommandPrompt/etc...

This might prod get_gwgroup_members_inner() to reactivate the member.

John

netblues

I wouldn't trust pinging google dns for gateway availability. I have seen google rate limiting pings leading to failing pings, (when at the same time everything else works.)
You can always find something closer within your isp for such checks.

As for the redmine bug, just hitting edit certainly doesn't do anything until you save..
I don't see this in other multiwans though.

SteveITS

I edited that a bit and ran this from Diagnostics/Command Prompt:

require_once("gwlb.inc");
$members = [];
$gwgroup = 'GWGROUP';
if (!empty($gwgroup)) {
$members = get_gwgroup_members($gwgroup);
}
var_dump ($members);

That reconnected the gateway as you theorized. In practice of course just viewing the gateways page is easier. :)

re: which IP to ping, I've tried picking an ISP's router partway up the chain and over time those can change. Since this is at a client's site that would be difficult to correct if the link goes down, though, in this case both WANs would likely not drop together. Pinging the ISP's router at the other end of the patch cable is of course not that helpful, though I've seen people leave the monitoring IP empty which does that. :)

serbus

Hello!

You could schedule the command to run every so often and then you wouldnt have to login to refresh the group manually.

John

SteveITS

I set up a cron job. Before that, some interesting notes for posterity:

Twice in the last couple of weeks the WAN2 gateway status reset by itself at 1:01 am. There is a cron job that runs /etc/rc.dyndns.update at that time. No we don't have a DDNS set up. There are however other days in the last few months it did not reset itself at that time. Unclear why the difference.

I found by accident this morning that if I edit/add a firewall rule and save/reload, that also updates the gateway status.

SteveITS

@netblues For what it's worth changing off using Google DNS as the gateway target didn't "prevent" the packet loss.

SteveITS

I noticed this was fixed in 2.5/21.2:
https://redmine.pfsense.org/issues/10546
"In this case, pfsense will consider a gateway down when it has actually returned to a normal state, necessitating administrator action to return it back to a proper state."